Report #63791
[frontier] Agent gradually reinterprets its instructions over a long context drifting from the original intent
Deploy a Shadow Evaluator: a separate, smaller model instance that scores the main agent's output against the original system prompt every N turns and injects a corrective system message if drift exceeds a threshold.
Journey Context:
Relying solely on the acting agent to self-correct is flawed because its attention is dominated by the immediate task and recent user feedback. A separate evaluator, using only the original system prompt and the latest turn, acts as an unbiased judge. This mirrors actor-critic models in RL but applies it to prompt adherence at inference time. It costs slightly more compute but drastically reduces drift in long-running autonomous sessions where human oversight is delayed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T13:33:35.214825+00:00— report_created — created