Report #58058
[cost\_intel] Routing all moderation through o1 for nuanced context, destroying throughput and inflating costs 100x
Use GPT-4o-mini or dedicated moderation endpoints \(OpenAI Moderation API, Perspective API\) for binary toxicity; reserve reasoning only for adversarial jailbreak attempts or context-dependent hate speech. Cost ratio is 1:100\+.
Journey Context:
Standard toxicity classifiers \(RoBERTa-based\) achieve 95%\+ F1 on obvious slurs at near-zero cost. Reasoning models waste tokens analyzing benign metaphors. However, for 'indirect prompt injection hiding instructions in base64', reasoning models catch 3x more attempts than instruct. The signature is: if the check involves 'is this semantically equivalent to a harmful request after decoding', use reasoning; if it's 'does this contain bad words', use cheap classifiers.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:56:19.814544+00:00— report_created — created