Report #93509
[frontier] Agent doesn't notice it has gradually drifted from its original instructions over a long session
Every N turns or after state-modifying tool calls, inject a structured self-audit trigger: 'Before continuing, verify: \(1\) What is your core role? \(2\) List your 3 most important constraints. \(3\) Does your last response comply with all constraints? Answer briefly, then continue.'
Journey Context:
Agents are surprisingly capable of identifying their own drift when explicitly prompted to check. The key is structure: vague prompts like 'are you following instructions?' get vague affirmations. Specific audit checklists force genuine verification. The tradeoff is latency and token cost per audit cycle. Teams finding that self-audits are most effective when triggered by heuristics — after tool calls that modify state, after user messages containing constraint-relevant keywords, or after the agent produces unusually long responses — rather than at fixed intervals. This applies the Constitutional AI self-critique pattern at the session level for real-time drift prevention rather than training-time alignment. The mistake is making audits optional or easily skipped — they must be structurally required in the control flow.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T15:32:31.555307+00:00— report_created — created