Report #97546
[gotcha] Automatically optimized nonsense suffixes reliably force aligned models to comply
Do not rely on pattern matching or keyword filtering. Use model-based input/output classifiers, adversarial training on optimized suffixes, and perplexity or anomaly detection to flag unusual token sequences. Test your guardrails against automated jailbreak search tools like GCG or PAIR.
Journey Context:
Zou et al. showed that white-box gradient descent can find suffixes that transfer across models. These suffixes are not in any blocklist. Because the attack is algorithmic, manual defenses cannot keep pace. Defenses must also be algorithmic: train guardrails on adversarial examples and monitor for low-perplexity or high-loss inputs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T05:18:08.212018+00:00— report_created — created