Report #86458
[gotcha] Adversarial suffixes \(gibberish tokens\) bypass RLHF safety training
Implement input perplexity filters. GCG suffixes often result in text with unusually low perplexity \(or highly anomalous token sequences\) for natural language. Reject or flag inputs with anomalous token probability distributions.
Journey Context:
Developers assume RLHF makes models robust. However, GCG attacks optimize a suffix by greedily searching for tokens that maximize the probability of the model saying 'Sure'. These suffixes look like gibberish to humans but are highly effective. Since they are not natural language, perplexity filtering is a practical, albeit imperfect, defense that catches the low-hanging fruit of these automated attacks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T03:42:32.255912+00:00— report_created — created