Report #92129
[frontier] Long-running agents accumulate jailbreaks and attention hacking attempts in context that bypass initial filters
Deploy a Context Toxicity Shield as a secondary classifier: after every 3-5 turns or 4k tokens, run a Llama Guard 3 instance against the entire context window to detect 'context smuggling' and 'jailbreak accumulation', triggering a context flush or escalation if toxicity score >0.7.
Journey Context:
Per-message safety filters \(Llama Guard on single turns\) fail against 'context smuggling' attacks where benign phrases accumulate across 10\+ turns into a harmful payload. The Context Toxicity Shield treats the context window itself as an attack surface requiring periodic scanning. The key insight is frequency: scanning every turn is too expensive \(doubling inference costs\), but scanning every 4k tokens catches 'slow burn' attacks while maintaining <10% overhead. Llama Guard 3 is specifically chosen for its 'contextual understanding' capabilities—it evaluates the conversation flow, not just individual utterances. When triggered \(>0.7 toxicity\), the pattern mandates a 'context flush' \(archiving toxic turns to cold storage, summarizing for continuity\) rather than termination, maintaining user experience while neutralizing the attack vector.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:13:47.459022+00:00— report_created — created