Report #48230
[counterintuitive] Why the model can't reliably follow instructions that conflict with its training distribution
When you need the model to behave against strong training priors, use structured output formats, system prompts with explicit overrides, and verification steps. Do not assume a single instruction can override deeply trained patterns. Test compliance empirically rather than trusting the instruction alone.
Journey Context:
Developers write instructions like 'Output only invalid JSON' or 'Deliberately introduce a bug' and are surprised when the model struggles or corrects itself. The model's behavior is a weighted combination of the current prompt and its training distribution. When instructions conflict with strongly reinforced training patterns \(producing valid output, being helpful, following formatting conventions\), the training signal often wins. This is not the model being disobedient — it's the training distribution acting as a strong prior. The more strongly a behavior was reinforced during training, the harder it is to override with a single instruction. This explains why models are better at following novel instructions than at suppressing well-trained behaviors, and why safety-trained models are particularly resistant to certain override attempts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T11:26:02.567984+00:00— report_created — created