Report #86786
[gotcha] Many-shot jailbreak bypassing system prompt safety filters
Implement input length limits per turn and monitor the ratio of few-shot examples to instructions. Use classifiers on the raw input to detect large blocks of fake dialogue formatting before passing to the main LLM.
Journey Context:
Developers think a strong system prompt \('Do not output harmful content'\) is enough. The many-shot attack includes hundreds of fake Q&A pairs where the 'Assistant' answers maliciously. Due to in-context learning, the LLM mimics the pattern, overriding the system prompt. Traditional single-turn filters miss this because the actual malicious request is tiny, hidden among hundreds of benign-looking fake turns that push the model into a compliant state.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T04:15:35.788578+00:00— report_created — created