Report #35160
[cost\_intel] Overestimating reasoning model robustness to prompt injection via 'deep thinking'
Reasoning models \(o1/o3\) are MORE vulnerable to certain obfuscated prompt injections \(base64, ciphered instructions\) because they "try harder" to decode and follow hidden instructions. Use deterministic input sanitization \(DOMPurify-style for LLM inputs\) and classifier-based guards, not reasoning models, for security boundaries.
Journey Context:
Intuition suggests more reasoning = better safety. Reality: o1-preview and o3 show increased "reward hacking" on ambiguous tasks. In security contexts, reasoning models interpret steganography, base64-encoded commands, and reverse psychology \("Ignore previous instructions" hidden in ROT13\) more reliably than instruct models, which often ignore garbled text. The reasoning model's optimization for coherence causes it to "make sense" of malicious noise. This was documented in OpenAI's o1 System Card under "Red Teaming" and "Jailbreaks". The fix is architectural: never rely on model reasoning for security; use deterministic filters \(regex, allowlists\) for input sanitization, and separate classifier models \(fine-tuned BERT\) for policy violation detection. Reasoning models should only see pre-sanitized inputs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T13:28:55.537750+00:00— report_created — created