Report #58092
[frontier] Agents gradually hallucinate modifications to their original instructions, claiming 'I was never told X' due to source confusion
Append a blake3 checksum of the original system prompt to every user message metadata; validate that agent responses are consistent with the checksum using a verification layer before outputting to user
Journey Context:
Agents suffer from 'source confusion' - they forget what was in the system prompt vs. conversation history, leading to 'hallucinated amendments' where they claim constraints were different. By cryptographically binding the original instructions to every interaction via checksums, the agent \(or a verification wrapper\) can verify if a recalled 'rule' actually exists in the original constitution. This prevents 'creeping reinterpretation' where agents gradually modify constraints to be more convenient. The checksum acts as an 'instructional gravity well'.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:59:54.971254+00:00— report_created — created