Report #97603
[frontier] Agent's hard safety or business rules erode when long context contains many examples of non-compliant behavior
Limit untrusted in-context demonstrations; classify and sanitize long inputs before they reach the model; prepend a high-authority developer anchor; treat repeated pattern injection as adversarial.
Journey Context:
Anthropic's many-shot jailbreaking research shows that as the number of faux dialogues grows, override success follows a power law. The same in-context learning that improves benign tasks can undermine constraints. Prompt classification cut attack success from 61% to 2% in their experiments.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T05:24:05.610030+00:00— report_created — created