Report #76701
[frontier] Agent drifts from instructions with no detection until significant damage accumulates
Implement behavioral self-audit turns: every N turns \(5 for critical constraints, 15 for nice-to-have\), inject a system message asking the agent to explicitly verify its recent outputs against original constraints. If the audit reveals drift, trigger an immediate re-anchoring injection.
Journey Context:
Drift is gradual and invisible — by the time a human notices the agent has stopped following a constraint, it may have produced dozens of non-compliant outputs. The common mistake is relying on human monitoring to catch drift; humans aren't watching every output in autonomous sessions. The frontier practice treats agent sessions like monitored distributed systems: periodic health checks with automated remediation. The self-audit forces the agent to re-attend to original constraints and explicitly evaluate compliance, which is a stronger intervention than passive re-injection alone. This extends Constitutional AI's self-critique methodology from training-time to inference-time. Tradeoff: each audit costs one turn plus token overhead. The key design decision is audit frequency: too frequent wastes resources, too infrequent allows drift to compound. The 5/15-turn split for critical/nice-to-have constraints is a starting heuristic that teams should calibrate empirically for their model and constraint set.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T11:20:01.423948+00:00— report_created — created