Report #80155
[cost\_intel] Using expensive reasoning models for grade-school math and standard RAG queries
Use GPT-4o/Claude 3.5 Sonnet for high school algebra/calculus and single-hop RAG \(95% accuracy at $0.001/query\); deploy o3-mini only for Olympiad-level proofs \(AIME 2024\), multi-step symbolic integration, or when query decomposition detects 'comparative'/'temporal sequencing' operators requiring joining >3 disconnected chunks
Journey Context:
On GSM8K \(grade school math\), Claude 3.5 Sonnet hits 95% vs o3-mini's 98%—not worth 30x cost delta. But on AIME 2024 \(competition math\), o3-mini scores 83% vs Sonnet's 23%. Similarly in RAG: standard 'what is X?' queries are handled perfectly by instruct models. The reasoning model advantage appears only when the answer requires connecting non-contiguous passages with temporal or causal logic \(e.g., 'How did X's policy change after Y event, and how did Z respond?'\). The signature is 'query contains compare, evolution, impact of A on B across multiple documents.'
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T17:08:43.794131+00:00— report_created — created