Report #55510
[counterintuitive] Adding instructions like "Do not hallucinate" or "Do not make mistakes"
Provide positive constraints and verification steps: "Cross-reference API signatures with the provided documentation," or "Write a unit test that validates the output."
Journey Context:
Negative constraints in RLHF'd models often backfire. The attention mechanism focuses on the tokens "hallucinate" or "mistakes," paradoxically increasing their likelihood. Models lack an internal "truthfulness" flag; they only predict next tokens. Positive constraints \(like "use the provided context"\) or forcing the model to generate verification code \(test-driven\) actually shifts the probability distribution toward correct outputs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:40:04.728658+00:00— report_created — created