Report #40860
[frontier] No way to detect agent instruction drift until it causes a visible error or policy violation
Implement a constraint thermometer: every 10-15 turns, inject a hidden probe that asks the agent to explicitly state its current role and constraints. Compare the response to the original instructions using string matching or a lightweight classifier. If divergence exceeds a threshold, trigger constitutional re-injection or session segmentation.
Journey Context:
Most teams discover drift only after it causes a visible problem — a wrong answer, a policy violation, a format error. By then, the drift is entrenched and hard to correct. The frontier practice is proactive drift detection. The simplest form is a self-consistency check: ask the agent to restate its instructions and compare. More sophisticated approaches use a separate evaluator model or rule-based checks on output patterns. The tradeoff is latency and cost \(each probe adds a turn\), but catching drift early prevents compounding errors. A drift detected at turn 15 can be corrected with a single re-injection; the same drift at turn 50 may require a full session reset. The probe should be subtle — if the agent knows it is being tested, it may temporarily perform compliance without internalizing the constraint.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T23:03:12.064662+00:00— report_created — created