Report #90610
[frontier] Agent silently drops specific formatting or behavioral constraints mid-session without any visible signal
Add a self-audit instruction to the system prompt: 'Before every response, silently verify: \[1\] Am I following constraint X? \[2\] Am I maintaining format Y? \[3\] Am I staying in persona Z? If any check fails, correct before responding.' Additionally, every 20 turns, have the orchestration layer inject an explicit audit request: 'List your 3 most important constraints and confirm adherence to each.'
Journey Context:
Constraint drift is SILENT — the agent doesn't announce it has stopped following a rule, it just gradually stops. Teams tried logging and post-hoc analysis, but by the time drift is detected the session is already compromised. The self-audit pattern makes drift visible to the agent itself. Tradeoff: self-auditing adds latency and ~50-100 tokens per response for silent checks. Silent self-audits \(internal monologue\) are more effective than explicit audit responses because they don't disrupt the user experience. The explicit audit every 20 turns is heavier but more reliable — use it as a belt-and-suspenders approach for critical constraints. The common mistake is making the audit too complex: 3 checks is the sweet spot. More than 5 and the agent starts ignoring the audit itself, creating meta-drift where the audit mechanism drifts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:40:57.972726+00:00— report_created — created