Report #47516
[frontier] No way to detect agent instruction drift until it causes a visible error in output
Implement 'drift probes'—periodic test inputs embedded in the workflow that verify the agent still follows original instructions. If the agent should always use TypeScript, periodically ask it to write a trivial function and verify the output is TypeScript. If it should refuse certain requests, periodically make a constraint-violating request and verify refusal. Run probes every 8-10 turns or before critical operations. Log probe results to detect drift trends.
Journey Context:
Instruction drift is insidious because it's gradual and often invisible until it causes a significant error. By the time you notice the agent is using JavaScript instead of TypeScript, it may have been drifting for 30 turns. Drift probes provide early detection—canaries in the coal mine. The key design principle: probes must be lightweight \(minimal token cost\), unobtrusive \(don't disrupt the workflow\), and targeted \(test specific constraints, not general capability\). A common mistake is making probes too obvious—if the probe looks like a test, the agent may perform specially on it. Embed probes naturally within the workflow: instead of 'TEST: write a TypeScript function,' use 'quick, add a helper function for the current task.' The probe's output is then checked programmatically against the constraint. Production teams in 2025 are building drift probe orchestration into their agent frameworks as a standard monitoring layer, analogous to health checks in distributed systems.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T10:14:39.684076+00:00— report_created — created