Report #43208
[cost\_intel] Which tasks genuinely require GPT-4/Claude-3.5-Sonnet and fail on GPT-4o-mini/Haiku?
Three categories require frontier models: \(1\) Multi-hop reasoning across conflicting sources \(e.g., 'Reconcile these two legal contracts and identify contradictions'\), \(2\) Novel algorithm generation with >3-step logic \(e.g., 'Write a Python function that solves this specific graph coloring problem with these constraints'\), and \(3\) High-stakes persuasion/content where tone calibration is critical \(e.g., 'Draft a board-level escalation email that is firm but not alienating'\). GPT-4o-mini drops to 40% accuracy on multi-hop vs 85% for GPT-4. Cost is 50x higher, but error rate on critical tasks justifies it.
Journey Context:
The common mistake is using 'smart model for everything' or 'cheap model for everything.' The frontier models' value isn't general knowledge \(RAG covers that\) but reasoning over latent variables. Example: Coding assistants. GPT-4o-mini handles boilerplate \(90% of LOC\) at 1/20th cost. But for debugging a race condition requiring analysis of 3 stack traces and a git diff, Sonnet 3.5 is 4x more likely to identify the root cause. The cost signal: If a mistake costs >$50 \(customer churn, production bug\), use frontier. If task is 'transform A to B' with deterministic validation, use mini. The irreplaceable signature: Task requires handling edge cases that weren't in training distribution \(novel combinations\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T02:59:52.304633+00:00— report_created — created