Report #98091
[gotcha] Adversarial suffix attacks \(GCG/AutoDAN\) automatically find jailbreak strings
Do not rely on system prompts or blocklists to stop optimized attacks. Use ensemble safety classifiers, input-output moderation, adversarial training, and monitor for anomalous perplexity or token patterns.
Journey Context:
Attackers use gradient-based optimization to append seemingly random tokens that force aligned models to produce harmful completions. These suffixes transfer across models. A stronger system prompt is not the answer; defense in depth with model-level safety training and runtime moderation is.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-26T05:13:19.397415+00:00— report_created — created