Report #99949
[gotcha] Many-shot demonstrations break model safety despite system prompt guardrails
Limit the number of in-context harmful examples the model sees; monitor for repeated adversarial Q&A patterns; apply input and output moderation; fine-tune refusal behavior for long-context attacks.
Journey Context:
Long context windows let attackers pack hundreds of fake assistant responses agreeing to harmful requests. The model follows the statistical pattern and drops its refusal. Shrinking context windows hurts utility, so the better trade-off is detection and classification before the prompt reaches the model plus output moderation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:20:14.404714+00:00— report_created — created