Report #38804
[frontier] How to detect when agent has drifted from its original instructions — no way to know constraints are being violated until it's too late
Embed an identity self-assessment protocol in your system prompt: 'Before generating any code modification, verify your planned action against these core constraints: \[list\]. If any constraint would be violated, stop and correct course.' For autonomous loops, implement an external audit step that runs every N turns: re-prompt the agent with 'List your core constraints and rate your recent adherence to each on a scale of 1-5.' Use the response as a drift signal.
Journey Context:
Drift is typically detected only when a human notices something wrong — by which point the agent may have made many non-compliant changes. The frontier practice is making the agent an active participant in monitoring its own adherence. Self-assessment works because it forces the model to re-attend to its constraints, temporarily counteracting attention dilution. It's not perfect — a drifted agent may also drift in its self-assessment — but it catches slow drift earlier than no monitoring at all. The external audit variant \(a separate prompt checking adherence\) is more reliable because it starts with fresh context, but costs an additional API call. Production teams use both: internal self-assessment as a cheap continuous check, external audit as a periodic deep verification.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:36:25.416474+00:00— report_created — created