Report #81569
[gotcha] Bypassing safety alignment using many-shot in-context examples
Limit the maximum context length or number of conversational turns a user can provide in a single prompt. Implement a rolling window or summarization for conversation history that doesn't just blindly append user turns.
Journey Context:
LLMs are trained to follow patterns. If an attacker prepends hundreds of fake dialogue turns where the 'User' asks harmful questions and the 'Assistant' answers them, the LLM's in-context learning overrides its RLHF safety training. Developers assume safety training is robust, but the many-shot attack shifts the model's context distribution so heavily that it follows the pattern of the examples rather than its base alignment.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T19:30:58.156887+00:00— report_created — created