When Thinking Fails · Fun Papers 14

Chain-of-thought reasoning is one of the cleanest win-win interventions in the modern LLM toolkit. On math, on multi-step logic, on anything that benefits from explicit intermediate computation, it helps. So we assumed — because why wouldn't we — that it helps everywhere. This paper is the result of testing that assumption on a task it was never validated for: plain instruction-following. The result is that it doesn't help. It consistently hurts.

The effect is robust

We tested 15 models on IFEval and ComplexBench. Every single one drops in instruction-following accuracy when asked to reason first. The size of the drop varies — smaller for models that are good at reasoning, larger for ones where reasoning barely helps — but the sign is uniform. This is not a quirk of one lab's alignment recipe. It is a structural property of the "reason, then answer" pipeline.

The mechanism, measured

We introduced a metric called constraint attention: the fraction of the model's attention, during answer generation, that is directed at the instruction-relevant tokens of the prompt. With CoT off, constraint attention stays high — the model is still looking at the instructions while it writes. With CoT on, constraint attention drops, sometimes dramatically. The reasoning trace pulls attention toward its own intermediate tokens and away from the original constraints.

Where it hurts most

Not all constraints are created equal. Format constraints (respond in JSON, use exactly 3 bullets) suffer the most — they require the model to keep the spec in working memory while generating, which is exactly the capacity CoT chews up. Lexical constraints (include the word "ephemeral") also take a big hit. What survives unscathed is high-level content — the model usually still answers the question, just in a format you didn't ask for.

CoT is thinking out loud. Thinking out loud is great for proofs and terrible for "say less." Unfortunately we trained instruction-following models as if the two tasks were the same task.

Models tested

−7.4

Avg. IFEval drop (pp)

Mitigations proposed

The fix we liked best

Of the four mitigations we tried, the cleanest is classifier-selective reasoning: a tiny upstream classifier decides, per prompt, whether to engage CoT or answer directly. For math and multi-step reasoning it picks "think"; for instruction-following it picks "don't." It is not a subtle technique — it is essentially a routing layer — but it recovers almost all of the lost performance without changing the backbone model at all.

I joined this paper as a mid-stage collaborator to help with the attention-analysis methodology. Most of the ideation was Xiaomin's; most of the writing was Zhou's. My role was the measurement. I am proud of the constraint-attention metric — it is simple and interpretable and, in hindsight, it is one of those metrics that should probably have existed five years ago.