Report #47270
[gotcha] Adversarial suffixes with gibberish tokens bypass safety filters
Implement output monitoring and use robust classifiers \(like Llama Guard\) rather than relying on input string matching; be aware that optimized adversarial tokens exist that can bypass alignment.
Journey Context:
Attackers use algorithms \(like Greedy Coordinate Gradient\) to append optimized, seemingly nonsensical tokens to a prompt. These tokens exploit the LLM's internal representations to force a harmful response, completely bypassing simple keyword filters or even the model's own safety training.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T09:49:38.344034+00:00— report_created — created