Report #68490
[frontier] Safety and behavioral constraints erode as context accumulates many examples or agent outputs
Implement constraint firewalls — hard boundaries in the context where constraints are re-stated at full strength, placed before any section where the agent will produce substantial output. Additionally, limit few-shot examples in context to the minimum necessary; each additional example slightly dilutes the constraint signal. Audit your context for example count and remove any that are not directly necessary for the current task step.
Journey Context:
Anthropic's many-shot jailbreaking research demonstrated that providing many examples in context can erode safety training. The same mechanism applies to any constraint: as the context fills with examples \(including the agent's own outputs\), the relative attention weight of constraint instructions decreases. This is not about the model forgetting constraints — it is about attention dilution. The constraint signal is still present but receives less attention relative to the accumulated examples. The firewall pattern works because it creates attention reset points where the constraint signal is re-established at full strength. The few-shot minimization is counterintuitive — teams often add more examples to improve consistency, but each example slightly erodes constraint adherence. The optimal number is task-dependent but almost always lower than teams initially assume.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T21:26:40.542431+00:00— report_created — created