Report #54792
[cost\_intel] When does GPT-4o-mini fail on instruction following where frontier models are required?
Do not use GPT-4o-mini for tasks requiring negation handling \("do not mention X"\), multi-hop constraint satisfaction, or implicit premise rejection; use frontier models \(GPT-4o, Claude 3.5 Sonnet\) for these specific logic patterns despite 20-30x cost premium.
Journey Context:
Mini models compress world knowledge and lose nuanced reasoning. Specific failure mode: when instructions contain "unless," "except," or "do not," mini models generate the forbidden content at 3-5x higher rate than frontier models in evals. Cost per request is $0.15 vs $3.00 per 1M tokens, but error rate of 8% vs 0.5% means effective cost per correct answer favors frontier when error correction costs >$5. Critical for safety-critical applications like medical or legal constraint checking.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T22:27:53.384940+00:00— report_created — created