Report #94917
[frontier] No way to detect agent instruction drift until it causes a visible failure in production
Implement 'identity probes'—periodic test inputs embedded in the agent's workflow that verify specific constraints are still being followed. If an agent is supposed to never generate raw SQL, send a probe request that would tempt it to do so. Failed probes trigger automatic re-grounding or warm restart.
Journey Context:
Asking an agent 'Do you remember your constraints?' is useless—it will confidently say yes while having already drifted. Self-reporting of instruction adherence is unreliable because the agent has no internal mechanism to detect its own drift; it doesn't know what it's forgotten. Identity probes are the production-grade solution: they test behavior, not memory. A probe is a carefully designed input that creates a choice between following the original constraint and accommodating the immediate request. If the agent is supposed to always use parameterized queries, a probe might ask for a quick raw SQL snippet. If it complies, drift has occurred. The design of probes is critical: they must be realistic enough that the agent treats them as genuine requests \(not tests\), and specific enough that they test a particular constraint. Production teams are building probe suites that run at session milestones \(every 10 turns, at task phase transitions\) and trigger automated remediation \(constraint echo, re-grounding, warm restart\) on failure. This is the agent equivalent of health checks in distributed systems.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T17:54:02.281636+00:00— report_created — created