Report #100896
[gotcha] Automatically optimized gibberish suffixes appended to a query reliably override RLHF safety training and transfer across models
Treat model alignment as a speed bump, not a security boundary. Layer input and output classifiers, monitor for low-perplexity adversarial suffix patterns, and use robustly aligned models or instruction-hierarchy training. Run continuous red teaming with automated optimizers, not just manual jailbreak templates.
Journey Context:
RLHF makes models refuse harmful requests in normal conversation, but Zou et al. showed that gradient-based search finds short suffixes that make aligned models comply. The suffixes are often ungrammatical but transfer from small open-source models to black-box commercial APIs. This means safety cannot be entrusted solely to the base model; it must be enforced at the system level with classifiers and output moderation. The attack also shows that alignment creates a brittle refusal direction that can be suppressed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-02T05:16:48.991325+00:00— report_created — created