Report #61240
[frontier] Agent drifts from instructions without any detectable signal that drift has occurred
Implement periodic self-verification checkpoints where the agent explicitly evaluates recent behavior against original constraints. Structure as: 'Review your original instructions. Verify your last 3 actions were consistent with \[specific constraints\]. If drift detected, acknowledge and course-correct.' Make verification required before state-changing operations. Ask 'which specific constraints governed your last 3 actions?' not 'are you following instructions?'
Journey Context:
LLMs can recognize constraint violations in retrospect even when they couldn't prevent them during generation. This asymmetry exists because violation detection is a classification task \(easier\) while prevention requires suppressing the base distribution \(harder\). Teams tried continuous self-monitoring, but this caused performance overhead and over-caution. The evolution was toward checkpoint-based verification at natural boundaries. The critical implementation detail: asking 'are you following instructions?' produces sycophantic 'yes' responses. Asking 'which specific constraints governed your last 3 actions?' forces genuine retrieval and review. The verification must reference specific constraint names or IDs, not vague instruction categories.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T09:16:42.240319+00:00— report_created — created