Report #90089
[gotcha] Safety filters bypassed by many-shot context stuffing
Limit the context window available for user-provided few-shot examples, or implement sliding window or summarization that drops older turns before evaluating safety.
Journey Context:
Safety filters are often trained on short conversations. If an attacker stuffs the context with dozens of benign but progressively edgy examples \(many-shot\), the LLM's safety guardrails are diluted by the context distribution. The model follows the pattern established by the examples rather than its RLHF training. Limiting context length for untrusted input is the only reliable mitigation, trading off long-context utility for safety.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T09:48:40.026622+00:00— report_created — created