AAAI · Technical Track · 20268 min read

ENCORE, downweight the confused rules, keep the confident ones.

Multi-head safety reward models judge a response against a list of rules — "is it toxic?", "is it factual?", "is it harmful?". Some rules are ambiguous; their scores are noisy. We showed that rules with high rating entropy hurt accuracy, and the simple fix — entropy-weighted composition — beats the state of the art without any retraining.
BEFORE · equal-weight average over all rulesrule A (low entropy · good)rule B (low entropy · good)rule C (HIGH entropy · noisy)rule D (HIGH entropy · noisy)Σ · mean → noisy aggregateAFTER · weight by (1 − H) · downweight noisyrule A · weight 0.90rule B · weight 0.82rule C · weight 0.15rule D · weight 0.08Σ w·r → trustable aggregateThe only change is the weights. No retraining. +4.5pts on RewardBench-safety.
Fig. 1 — Rules with high rating entropy (noisy judgments) get downweighted. Confident rules dominate the aggregate.

Safety alignment uses reward models to score model outputs against a catalogue of rules. A single "is this safe?" head is coarse; the trend has been to split it into many heads — "is it toxic?", "is it harmful?", "is it factual?" — and aggregate. The natural aggregation is the mean, or a learned linear combination. We found that both of those waste information, and we found a simple fix.

The observation that started it

Looking at per-rule accuracy on RewardBench, we noticed a strong negative correlation between a rule's rating entropy (the variance of how it scores a fixed set of responses) and its downstream accuracy. Rules with crisp, low-entropy judgments were the ones that separated preferred from dispreferred responses. Rules whose judgments were all over the place did roughly nothing.

If noisy rules are noise, you should downweight them. The weight that falls out of a Bradley–Terry analysis of this setup is almost exactly 1 − H(rule), where H is the normalised entropy of the rule's scores over a calibration set. That is what ENCORE does.

Interactive · see how entropy weighting changes the aggregate
AGGREGATE REWARD · sep. between preferred & dispreferredseparation: 0.42
Slide right to downweight high-entropy rules. The two distributions pull apart — that gap is what the reward model actually uses to rank.
Simulated Gaussian mixture · shape matches ablation in §4.2

Why this is theoretically grounded

Under Bradley–Terry, the optimal aggregation weights are inversely proportional to the variance of each rule's contribution to the margin. Rating entropy is a close proxy for that variance in the bounded-score regime safety models operate in. So ENCORE is not a heuristic; it is a first-order approximation to the Bayes-optimal aggregator. The approximation is cheap, the approximation is interpretable, and the approximation works.

Sometimes the paper that gets accepted is the one where you noticed a simple coefficient and argued carefully that it is the right one. That is this paper.
+4.5
Pts on RewardBench-safety
0
Training required
interpretable
Per-rule weights

My contribution

Xiaomin drove the project. I contributed the Bradley–Terry derivation and the theoretical writeup. The empirical sweep was largely Jingxuan and Mingye. I think this paper will be useful because it proposes something an alignment team can apply to their existing reward models without touching a GPU — and that kind of drop-in fix is how alignment research actually reaches deployment.

I am a co-author with Xiaomin Li on "When Thinking Fails" (paper 14). Both papers share a methodological temperament: look at what a model is actually doing at the attention / score level, find a structural issue, and fix it with a lightweight intervention. I keep drifting toward that kind of work. It feels like the shape of a research taste forming.

← all fun papers Next: 16 QLoRA Earnings →