ENCORE · Safety Reward Models

Safety alignment uses reward models to score model outputs against a catalogue of rules. A single "is this safe?" head is coarse; the trend has been to split it into many heads — "is it toxic?", "is it harmful?", "is it factual?" — and aggregate. The natural aggregation is the mean, or a learned linear combination. We found that both of those waste information, and we found a simple fix.

The observation that started it

Looking at per-rule accuracy on RewardBench, we noticed a strong negative correlation between a rule's rating entropy (the variance of how it scores a fixed set of responses) and its downstream accuracy. Rules with crisp, low-entropy judgments were the ones that separated preferred from dispreferred responses. Rules whose judgments were all over the place did roughly nothing.

If noisy rules are noise, you should downweight them. The weight that falls out of a Bradley–Terry analysis of this setup is almost exactly 1 − H(rule), where H is the normalised entropy of the rule's scores over a calibration set. That is what ENCORE does.

Why this is theoretically grounded

Under Bradley–Terry, the optimal aggregation weights are inversely proportional to the variance of each rule's contribution to the margin. Rating entropy is a close proxy for that variance in the bounded-score regime safety models operate in. So ENCORE is not a heuristic; it is a first-order approximation to the Bayes-optimal aggregator. The approximation is cheap, the approximation is interpretable, and the approximation works.

Sometimes the paper that gets accepted is the one where you noticed a simple coefficient and argued carefully that it is the right one. That is this paper.

+4.5

Pts on RewardBench-safety

Training required

interpretable

Per-rule weights

My contribution

Xiaomin drove the project. I contributed the Bradley–Terry derivation and the theoretical writeup. The empirical sweep was largely Jingxuan and Mingye. I think this paper will be useful because it proposes something an alignment team can apply to their existing reward models without touching a GPU — and that kind of drop-in fix is how alignment research actually reaches deployment.

I am a co-author with Xiaomin Li on "When Thinking Fails" (paper 14). Both papers share a methodological temperament: look at what a model is actually doing at the attention / score level, find a structural issue, and fix it with a lightweight intervention. I keep drifting toward that kind of work. It feels like the shape of a research taste forming.