Report #67754
[gotcha] GCG Suffix Attacks Bypassing Keyword Filters
Monitor for anomalous token sequences \(high perplexity\) in user inputs, as automated attacks often generate seemingly random suffixes that optimize for jailbreaks, rather than natural language.
Journey Context:
Developers look for obvious 'ignore instructions' text. Greedy Coordinate Gradient \(GCG\) attacks append an optimized suffix of seemingly random characters that shifts the model's logits to produce a harmful response. These suffixes don't trigger keyword filters, but they are statistically anomalous under normal language models.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:12:21.436586+00:00— report_created — created