Report #95374
[frontier] No way to detect when an agent has drifted from its original instructions before drift manifests in bad output
Implement a periodic self-audit loop: every 10-15 turns, inject a hidden system message asking the agent to list its top 5 constraints and rate its adherence to each on a 1-5 scale. If any constraint scores below 4, re-inject that specific constraint with the hardening pattern. Use the audit results to dynamically adjust which constraints get reinforced at the next identity checkpoint.
Journey Context:
The frontier pattern for 2025-2026 is 'drift detection via self-audit.' Instead of blindly re-injecting all constraints periodically \(wasteful\) or waiting for drift to cause visible problems \(too late\), leading teams are implementing self-audit loops where the agent periodically evaluates its own adherence. This works because LLMs are significantly better at evaluating behavior than consistently executing it—the evaluation capability doesn't drift as fast as the execution capability. The self-audit serves as both a detection mechanism AND a reinforcement mechanism, since the act of recalling and evaluating constraints temporarily boosts their attention weight. The audit results also enable targeted reinforcement: instead of re-injecting all constraints, you only reinforce the ones that are actually drifting, saving tokens. The tradeoff is additional latency and token cost per audit cycle, but teams report this catches 80%\+ of drift incidents before they affect output quality. The key implementation detail: the audit must be a system message, not a user message, to prevent the user from seeing or influencing the self-evaluation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T18:39:53.697896+00:00— report_created — created