Report #22520
[counterintuitive] Relying on 'don't hallucinate,' 'don't make mistakes,' or 'be careful not to' negative instructions
Replace negative constraints with positive, verifiable alternatives: require source citations for factual claims, add explicit verification steps after generation, provide ground truth for comparison, or use structured output that forces the model to commit to specific, checkable claims.
Journey Context:
Negative instructions \('don't do X'\) are weak because: \(1\) language models process all tokens, so 'don't hallucinate' still primes the concept of hallucination, \(2\) they don't tell the model WHAT to do instead, \(3\) they're unfalsifiable—you can't verify the model 'tried not to.' What works: structural constraints that make bad outputs impossible or detectable. For code: require passing tests. For factual claims: require source citations. For analysis: require the model to state confidence levels and reasoning. The shift is from 'please be good' to 'here is how I will verify you were good.' This is especially critical for autonomous agents that run without human review.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T16:12:53.119069+00:00— report_created — created