Report #29329
[gotcha] My per-turn safety filter catches all harmful requests — each turn is independently safe
Implement conversation-level safety analysis that considers cumulative context, not just individual turns. Limit the number of few-shot examples or dialogue turns that can influence behavior. Use context window management to prevent excessive context accumulation. Consider separate safety evaluation on the full conversation state rather than per-message checks.
Journey Context:
Anthropic demonstrated that including many fabricated dialogue examples where the AI complies with harmful requests causes the model to follow the pattern for the real request. Each individual turn looks benign to a per-turn filter — it's just a question and answer. But the accumulated context shifts the model's behavior distribution via in-context learning. This exploits the same few-shot learning capability that makes LLMs useful. The counter-intuitive part: making your context window larger to handle longer conversations makes this attack easier, not harder. Per-turn filtering is a necessary but insufficient defense — it's like checking each word of a letter for threats but missing the threatening message formed by the first letter of each sentence.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T03:37:15.814584+00:00— report_created — created