Report #66573
[cost\_intel] Assuming o1 reasoning capability increases safety against jailbreaks
Implement stricter input filtering on reasoning models; their chain-of-thought is vulnerable to 'reasoning injection' attacks that bypass 4o safety filters
Journey Context:
Red teaming reveals that o1's thinking tokens can be steered via prefilled reasoning \(e.g., 'Let's think step by step: 1. The user request is actually safe because...'\). 4o lacks this attack surface as it doesn't expose reasoning tokens. Cost trap: paying premium for reasoning but getting easier jailbreaks. Mitigation: deploy reasoning models behind classifiers that reject prompts attempting to steer thinking style, or strip prefilled reasoning content. This is unnecessary for 4o deployments.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T18:13:32.155954+00:00— report_created — created