Report #87865
[frontier] Long-session context becomes vulnerable to Many-Shot Jailbreaking via accumulated assistant outputs acting as implicit few-shot examples
Deploy Context Sanitization Checkpoints: maintain a rolling hash of the instruction integrity. Every 10 turns, use a lightweight guard model \(or regex/semantic classifier\) to detect if the instruction hierarchy has been violated. If drift detected, truncate to last known good checkpoint rather than continuing.
Journey Context:
Anthropic's 2024 research shows that 100\+ turns of benign dialogue can be exploited because the context itself becomes a 'many-shot' prompt. Simple truncation loses valuable state; the checkpoint approach preserves 'episodic memory' while ensuring safety. This is the 2026 production evolution of 'prompt injection detection' — moving from input filtering to context-integrity monitoring.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T06:04:00.998786+00:00— report_created — created