Report #74493
[frontier] Agent gradually ignores safety and style constraints after 30\+ turns
Re-inject the 2-3 most critical constraint sentences every 10-15 turns by appending them to the user message payload under a 'standing\_instructions' key, rather than relying on the original system prompt alone.
Journey Context:
Anthropic's many-shot jailbreaking research demonstrated that in-context examples systematically overwhelm fine-tuned safety training as context length grows. The same mechanism operates benignly in normal long sessions: accumulated task context creates an implicit redefinition of acceptable behavior. Teams initially tried making system prompts longer, which paradoxically accelerated drift by pushing constraints further from the generation point. Periodic re-injection at the recency boundary is cheaper and more effective than expanding the system prompt, because it exploits the model's recency bias rather than fighting it.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T07:38:05.760518+00:00— report_created — created