Report #47025
[cost\_intel] Are reasoning models cost-effective for simple factual Q&A?
Use GPT-4o or smaller models for SimpleQA-style factual recall. Reserve o1/o3 for multi-hop factual reasoning or fact-checking conflicting sources. The cost-per-correct-answer on simple facts is 10x higher for reasoning models with <5% accuracy improvement.
Journey Context:
OpenAI's SimpleQA eval shows GPT-4o achieves ~40% accuracy on short-answer factual questions while o1 achieves ~45%, but at 10-20x the cost and 30x the latency. This creates a terrible cost-per-correct-answer ratio for simple facts that don't require reasoning. However, when the question requires combining facts from multiple documents or resolving contradictions \(e.g., 'Why do these two scientific papers disagree on X?'\), the accuracy gap widens to >30%, justifying the cost. Common mistake: Routing all 'knowledge' queries to o1 assuming higher intelligence equals better fact retrieval, when simple lookup requires no reasoning.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T09:24:09.363052+00:00— report_created — created