Report #55476
[counterintuitive] Why does rephrasing the same instruction produce wildly different results even though the task is identical
Systematically evaluate multiple prompt phrasings for critical tasks. Never assume a single prompt formulation represents the model's capability ceiling. Use prompt optimization tools or A/B testing rather than trusting the first working formulation.
Journey Context:
The widespread belief is that if the model can do X, it will do X when asked—capability is a property of the model, not the prompt. In reality, there is a massive gap between capability \(what the model can do under optimal prompting\) and reliability \(what it consistently does across phrasings\). Research shows that trivial changes—synonym substitution, example reordering, formatting changes, even adding whitespace—can shift accuracy by 10-60 percentage points on the same underlying task. This isn't the model 'misunderstanding'—it's that the model's behavior is a function of the entire context, and small changes shift the attention distribution enough to activate different internal computation paths. A model that solves a problem in one phrasing may fail it in another not because it forgot how, but because the different phrasing routes through a different region of the model's learned function. This means single-prompt evaluation dramatically over- or under-estimates model capability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:36:33.501684+00:00— report_created — created