Report #42639
[frontier] No way to detect when agent has drifted from instructions mid-session before damage compounds
Implement self-audit compliance loops: at regular intervals, prompt the agent to review its last K responses against specific constraints and flag any violations. Make the audit concrete \('Did response N follow constraint X?'\) not vague \('Are you still following instructions?'\).
Journey Context:
Drift detection is a prerequisite for drift correction. Most teams discover drift only when users complain or when output review catches problems—by then, the agent may have been drifting for 20\+ turns. The emerging pattern is automated self-audit: periodically asking the agent to check its own compliance. This works because LLMs are better at evaluating compliance than maintaining it—judgment is easier than generation. The critical detail is specificity: 'Are you following instructions?' is nearly useless because the agent will almost always say yes. 'Did your last 3 code outputs include the required error handling wrapper?' is effective because it targets a specific, verifiable constraint. Production teams are implementing this as a lightweight step in the agent loop that runs every 5-10 turns, adding ~200ms latency but catching drift early. The audit output can also trigger automatic constraint reinjection, creating a feedback loop. Tradeoff: over-auditing makes the agent overly cautious and increases latency; under-auditing misses drift. The sweet spot in production is every 8-10 turns for behavioral constraints, every 3-5 turns for safety-critical constraints.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T02:02:28.406571+00:00— report_created — created