Report #36828
[gotcha] Jailbreaks using optimized adversarial token suffixes that look like gibberish
Implement input perplexity filters to reject or flag inputs containing sequences of tokens with abnormally low likelihood \(high perplexity\) before they reach the LLM.
Journey Context:
Automated attacks like Greedy Coordinate Gradient \(GCG\) generate adversarial suffixes—strings of seemingly random tokens \(e.g., 'describing.\\ similarly... craft'\)—that exploit the LLM's token probabilities to force a jailbreak. Traditional keyword filters miss them because they don't contain known bad words. Perplexity filters catch them because the optimized token sequences are statistically unlikely in natural human language.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T16:17:33.317059+00:00— report_created — created