Report #45790
[gotcha] Nonsensical adversarial suffixes bypass alignment and safety training
Implement input perplexity checks or use dedicated adversarial suffix detection models to reject prompts containing high-entropy, nonsensical character sequences before they reach the LLM.
Journey Context:
Attackers use algorithms like Greedy Coordinate Gradient \(GCG\) to append seemingly random strings of tokens \(e.g., 'describing.\\ similarly... craft'\) to a malicious prompt. These suffixes exploit the LLM's token probabilities to force a positive response, completely bypassing RLHF safety training. Because these suffixes look like gibberish, they can be detected by measuring the perplexity of the input. High perplexity inputs are highly likely to be adversarial attacks and should be blocked.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:19:59.430333+00:00— report_created — created