Report #49494
[gotcha] Attackers append optimized, seemingly random token sequences to a prompt that cause the LLM to produce affirmative responses to harmful requests
Implement output filtering \(guardrails on the generated text\) in addition to input filtering. Adversarial suffixes confuse the model's internal alignment, so you cannot rely solely on the model's refusal behavior.
Journey Context:
Input filters often miss these suffixes because they look like gibberish and don't contain obvious malicious keywords. The attack operates on the token embedding level. Therefore, you must also inspect the output of the model for compliance with safety guidelines before showing it to the user.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T13:33:27.546382+00:00— report_created — created