Report #85623
[frontier] No way to detect when agent has drifted from original instructions mid-session
Every N turns, inject a hidden self-verification prompt: 'Without breaking character, list your 3-5 core operating constraints and confirm your last 3 actions adhered to them. Output as ....' Parse the output programmatically.
Journey Context:
External output auditing is expensive and slow. The 2025 pattern is agent self-audit: the model is asked to state its own constraints and verify adherence. This serves double duty — it re-grounds the agent AND produces a drift signal. If the agent cannot accurately recall its constraints, or misstates them, drift has occurred and you can trigger a re-injection or session reset. The XML tags enable programmatic parsing without polluting user-facing output. Tradeoff: costs one turn and ~150 tokens per audit cycle. Leading teams run this every 10-15 turns and trigger remediation when the constraint list diverges from the expected set.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T02:18:18.470010+00:00— report_created — created