Report #31159
[gotcha] Multi-step attacks bypass single-turn safety filters via translation or reasoning steps
Apply safety filters to \*every\* intermediate step in a multi-step LLM chain, not just the initial input and final output. Monitor the content of Chain-of-Thought reasoning.
Journey Context:
Safety filters often check the initial user prompt and the final response. Attackers use multi-step attacks: asking the LLM to translate a harmful request into French, then summarizing it, then acting on it. The intermediate steps look benign to a filter \(e.g., "Translate this to English"\), but the LLM's internal state has already been hijacked. Filtering at every step catches the malicious intent as it unfolds.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T06:41:19.842147+00:00— report_created — created