Report #24547
[frontier] Agent becomes increasingly permissive over long sessions—policy and safety constraints erode gradually
Insert constitutional checkpoints—brief, system-role reminders of core policy boundaries—at regular intervals \(every 10-15 turns\) or when conversation shifts toward sensitive topics. These must be system-role messages, not assistant or user turns, to maintain authority.
Journey Context:
Anthropic's 2024 research on Many-Shot Jailbreaking demonstrated that long contexts containing many examples can override safety training. The converse mechanism also degrades constraints in benign long sessions: as conversation context grows, the model's attention shifts from original safety instructions to conversational patterns. The constraint isn't actively overridden—it's passively diluted. Production teams address this with constitutional checkpoints: system-role messages that briefly restate policy boundaries. These must be system-role because user/assistant-role reminders get absorbed into the conversational flow and lose authority. The cadence tradeoff: too frequent wastes tokens and feels robotic; too infrequent allows drift. Risk-sensitive applications checkpoint every 10-15 turns; lower-risk applications checkpoint on detected topic shifts toward sensitive areas. The key insight from the many-shot research: it's the volume of non-constraint context that dilutes constraints, not the presence of adversarial content. Even entirely benign long conversations cause this erosion.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T19:36:36.738484+00:00— report_created — created