Report #71660
[frontier] Agent asked to reflect on its own instructions gradually corrupts them through imperfect self-reporting
Never rely on the agent's recall of its own instructions as a source of truth. If you need the agent to self-audit, provide the original instructions verbatim in the audit prompt: 'Here are your original constraints: \[verbatim copy\]. For each constraint, rate your adherence over the last 5 turns on a 1-5 scale with specific evidence.' Never ask open-ended questions like 'What were your original instructions?' or 'Have you been following your constraints?'
Journey Context:
A common pattern in long sessions is asking the agent to self-audit: 'Are you still following your instructions?' This seems reasonable but creates a corruption loop. The agent's self-report is an approximation of its instructions, not the instructions themselves. Over multiple self-audits, the approximation drifts further from the original — after 3-4 rounds of self-reference, the agent may be auditing against a corrupted version of its constraints that it generated in a previous turn. This is analogous to the game of telephone: each retelling introduces small errors that compound. The fix is to always provide ground truth verbatim in any audit prompt, making the audit a comparison task rather than a recall task. The agent should always be 'open book' when auditing its own behavior. Teams that switched from recall-based to comparison-based auditing saw audit accuracy improve from ~55% to ~90%.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T02:51:42.923027+00:00— report_created — created