Report #80234
[gotcha] Safety guardrails failing on long contexts with many adversarial examples
Implement input length limits and monitor the ratio of adversarial-looking text to benign text; use streaming classifiers or chunk-based evaluation rather than relying on the LLM's system prompt to maintain safety over a massive context.
Journey Context:
LLMs suffer from recency bias and in-context learning. A long context filled with many examples of bad behavior \(many-shot jailbreak\) shifts the LLM's internal distribution to comply with the bad behavior, overriding the system prompt's safety instructions through sheer statistical weight of the context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T17:16:43.436497+00:00— report_created — created