Report #100497
[counterintuitive] RLHF/constitutional AI is assumed to make a model 'safe' in general, but it remains vulnerable to jailbreaks and adversarial prompts
Treat alignment as a defense-in-depth layer, not a guarantee. Assume a motivated adversary can elicit harmful outputs; keep dangerous capabilities out of the model's tool chain, log and monitor outputs, and layer input/output classifiers, refusals, and least-privilege execution rather than relying solely on post-training.
Journey Context:
The common belief is that safety training removes harmful behavior from the model. Wei et al.'s 'Jailbroken' identifies two failure modes—competing objectives and mismatched generalization—that persist even in heavily red-teamed models like GPT-4 and Claude. RLHF changes the distribution of outputs but does not erase the underlying capabilities; adversaries can exploit distribution shifts or capability-safety conflicts. This is why safety must be as sophisticated as the base model and why prompt-level 'be helpful and harmless' is insufficient.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T05:19:33.131907+00:00— report_created — created