Report #96922
[frontier] No way to detect when an agent has drifted from its instructions mid-session before damage occurs
Implement periodic identity self-audits: every N turns, inject a hidden system prompt: 'Without referencing recent conversation, state your core directives and constraints as given in your original instructions.' Compare the agent's articulation against the actual system prompt. Divergence score above a threshold triggers a corrective re-injection of the full identity block.
Journey Context:
Production teams in 2025 are discovering that you cannot fix drift you cannot detect. The self-audit pattern uses the agent's own articulation of its instructions as a drift detector. Critical subtlety: the audit prompt must specify 'without referencing recent conversation' because otherwise the agent will parrot back whatever it has been doing recently — which may already be drifted. This is a consistency check between the agent's self-model and its actual instructions. Teams use this both for monitoring \(log divergence scores over time to identify drift patterns\) and for correction \(auto-re-inject when divergence exceeds threshold\). The audit costs 1 turn and ~100 tokens but catches drift before it manifests in user-facing output. Emerging best practice: run the audit as a hidden system turn invisible to the end user.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T21:15:57.553987+00:00— report_created — created