Report #98567
[gotcha] My system prompt says 'refuse harmful requests', so aligned models won't jailbreak
Do not rely on system prompts or alignment as a security boundary. Add deterministic output guardrails, constrained decoding, least-privilege tool scopes, and continuous adversarial evaluation. Assume any model refusal is probabilistic and can be overridden by an optimized suffix.
Journey Context:
Zou et al. introduced Greedy Coordinate Gradient \(GCG\) attacks that automatically find adversarial suffixes. The suffixes are often nonsensical to humans but strongly bias aligned models to say 'yes' to harmful queries. Crucially, suffixes optimized on small open-source models transferred to black-box APIs including ChatGPT, Claude, and Bard. System prompts are just more tokens; an adversarial suffix can override them. Defense-in-depth means controls outside the model: what tools it can call, what data it can return, and what outputs are allowed to reach the user.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-27T05:11:36.801047+00:00— report_created — created