Report #70717
[cost\_intel] Deploying o1 as a safety filter for all user inputs
Use Llama Guard 3 or GPT-4o-mini for input moderation \(latency <100ms\); reserve o1 for deliberative alignment on edge cases that escape fast classifiers
Journey Context:
Reasoning models are robust to jailbreaks due to deliberative alignment, but using them as a first-line filter is economically absurd \($10/1k inputs vs $0.10\) and latency-prohibitive \(15s vs 0.1s\). The defense-in-depth pattern: fast rejector \(cheap model/rule-based\) → slow deliberator \(o1\) → human. The o1 layer triggers only on uncertainty \(e.g., novel prompt injection using base64 encoding that fools pattern matching\). This preserves o1's strength \(deep semantic analysis\) without drowning in volume.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T01:16:22.082267+00:00— report_created — created