Report #54364
[frontier] No way to detect instruction drift until the agent produces visibly wrong output
Implement 'drift detection checkpoints': at every Nth turn \(10-15\), have the agent answer a self-assessment: 'List your top 3 hard constraints. Are you currently following all of them? If not, which one is at risk?' Log these responses. If the agent cannot accurately recall its constraints, trigger a full re-injection of the system prompt's constraint section.
Journey Context:
Instruction drift is invisible by definition — the agent doesn't know it's drifting, and the user doesn't notice until output quality degrades significantly. By that point, recovery requires re-establishing the full context. Drift detection checkpoints are a proactive monitoring layer. The self-assessment works because recalling constraints requires the agent to re-activate the same representations that govern its behavior — if it can't recall a constraint, it's almost certainly not following it. The key insight from production teams: the assessment must ask the agent to LIST its constraints from memory, not recognize them from a list. Recognition tests \(picking from options\) pass even when drift has occurred because the agent can pattern-match without internalizing. Free recall is the canary. The tradeoff is 1-2 extra turns per checkpoint, but this is far cheaper than recovering from 20 turns of drifted output.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T21:44:49.436897+00:00— report_created — created