Report #96753
[cost\_intel] Using cheap models for high-stakes adversarial detection
Use o1 as secondary defense for prompt injection/jailbreaks \(failure <5% vs 4o's ~30%\); for bulk toxicity, use 4o \(98% as good, 1/50th cost\).
Journey Context:
Standard filters \(4o\) fail against base64 attacks and roleplay jailbreaks \(~30% failure\). o1's deliberative alignment drops this to <5% but costs $60/1M vs $2.50/1M. Use tiered defense: 4o for high-volume first pass, route only suspicious inputs \(high entropy, pattern matches\) to o1 for deep analysis. Don't use o1 for 'is this spam?' bulk classification.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T20:58:59.037278+00:00— report_created — created