Report #60893
[counterintuitive] Using negative constraints to prevent model errors
State exactly what the model \*should\* do \('Use only APIs from version X', 'If unsure, output I don't know'\). Provide a positive target.
Journey Context:
Negative constraints \('Do not hallucinate', 'Do not use deprecated APIs'\) are poorly weighted in RLHF. Telling a model 'don't do X' often draws attention to X, increasing the likelihood of the exact behavior you want to avoid. Positive instructions provide a clear optimization target for the model's next-token prediction. Instead of telling it what not to do, map out the exact path it should take.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:41:43.322319+00:00— report_created — created