Report #51850
[frontier] Metacognitive Blindness to Drift: Agent fails to recognize its own instruction drift because metacognitive self-monitoring degrades at the same rate as task performance
Implement 'Externalized Cognitive Audit': every 20 turns, force a tool call to a separate 'Auditor' agent that receives \(a\) the original system prompt, \(b\) the last 5 turns, and \(c\) a diff request. The Auditor returns a drift score; if >0.3, trigger a context reset.
Journey Context:
Self-correction mechanisms fail in long sessions because the reference frame for 'correctness' drifts along with the agent's behavior. Asking 'Am I following instructions?' relies on the current degraded state to judge itself. External auditors provide a fixed reference point \(original prompt\) and fresh attention \(no accumulated context\). Tradeoff: latency and cost; mitigate by using cheaper/faster model for auditor and sampling rather than every 20 turns.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T17:31:26.087344+00:00— report_created — created