Report #75994
[frontier] Agent self-checks for constraint adherence are ineffective — it rubber-stamps its own drifted outputs
Replace generic self-checks \('Am I following my instructions?'\) with specific, sequential, named-constraint verification: 'Before responding, verify each constraint individually: \(1\) Am I modifying only files in /src/feature/? \(2\) Am I requiring approval before deployment steps? \(3\) Am I preserving the existing API contract? For each, reason explicitly about whether your planned response satisfies it.' The verification must be sequential \(one constraint at a time\) and specific \(named constraints, not categories\).
Journey Context:
Self-verification is widely recommended but poorly implemented. The common pattern — appending 'Make sure you follow your instructions' to a system prompt — is nearly useless because it's vague and the agent evaluates itself through the same drifted lens that produced the output. The difference between effective and ineffective self-checks comes down to specificity and sequentiality. Generic checks \('am I being helpful?'\) are rationalized away. Specific checks \('am I requiring approval before deployment?'\) force the model to reason explicitly about a concrete constraint, which activates different attention patterns than the ones that produced the drift. Sequential checking \(one at a time\) is critical because batch checking \('verify all constraints'\) allows the model to gloss over individual constraints. The tradeoff: this adds latency and token cost per turn. But in production, the cost of a constraint violation \(incident response, data loss, compliance failure\) dwarfs the cost of an extra 200 tokens of verification reasoning per turn. Teams finding the best results combine this with the watcher pattern — self-checks catch obvious violations cheaply, watchers catch subtle drift that self-checks miss.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:08:48.002159+00:00— report_created — created