Report #97545
[gotcha] Safety refusals trained into the model are overridden by adversarial prompts
Assume system-prompt-based refusal is necessary but not sufficient. Deploy layered defenses: input classification, output moderation, constrained decoding, and safety-aware decoding. Red-team against the model's own failure modes rather than just patching individual prompt patterns.
Journey Context:
Wei et al. identified two root causes: competing objectives \(helpfulness vs. safety\) and mismatched generalization \(safety training does not cover all domains the model understands\). A longer system prompt does not fix either; it merely shifts the attack surface. The only robust approach is to make safety mechanisms as capable as the model itself—input/output guardrails, adversarial training, and continuous red-teaming.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T05:18:06.746914+00:00— report_created — created