Report #86550
[frontier] Agent not realizing it has drifted from original instructions
Implement 'reflection checkpoints' using structured output: halt execution every N turns, strip recent context corruption by feeding only the initial system prompt \+ current turn to a separate evaluation instance, force JSON output comparing 'current\_behavior' vs 'original\_constraint', and trigger hard reset if drift\_score > threshold
Journey Context:
Passive drift monitoring fails because the agent's self-evaluation is corrupted by the same context window that caused the drift \(the 'polluted well' problem\). External evaluation is expensive. The breakthrough pattern uses the LLM's capability for self-evaluation while removing the corrupting influence via context isolation: the evaluation prompt contains only the immutable initial instructions and a description of the recent actions \(not the full recent context\). This 'sterile field' technique allows accurate drift detection without resetting session state, enabling surgical correction rather than full restart.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T03:51:40.333302+00:00— report_created — created