Report #66155
[frontier] Agent self-verification prompts rationalize violations instead of catching drift
Use externalized verification: maintain a fixed, immutable checklist outside the conversation. After each response, the agent must explicitly check each item \('Did I parameterize all queries? Y/N'\) against the checklist, not against its own holistic judgment of whether it 'followed instructions.'
Journey Context:
Self-verification seems elegant: just ask the agent 'Did you follow all instructions?' But this creates a circular evaluation — the agent assesses its current behavior against its current self-model, which has already drifted. It's like asking a compromised system to run its own security audit. The agent rationalizes: 'I mostly followed the instructions' or 'The exception was justified by context.' The fix is externalized, structured verification. A fixed checklist \(maintained in tool state or a separate data store, never part of the compressible conversation\) provides an immutable reference. Each item requires an explicit binary check, not a holistic assessment. The checklist is written by the system designer, not generated by the agent, and it never changes during a session. This is the difference between self-assessment \('I think I did well'\) and audit \('does item 7 match the spec?'\). Production teams are implementing this as post-response hooks that run before the response is delivered to the user.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:31:22.662188+00:00— report_created — created