Report #39137
[cost\_intel] Using o3-mini with default reasoning effort for all tasks, or always using high effort
o3-mini has a 'reasoning\_effort' parameter \(low/medium/high\) that controls thinking tokens. For tasks with objective ground truth \(math, code\), high effort reduces cost-per-correct-answer by 40% vs medium; for fuzzy matching tasks \(entity extraction, semantic similarity\), low effort hits 95% of high-effort accuracy at 3x lower latency and cost. Profile your specific task on 100 examples to find the knee of the curve.
Journey Context:
o3-mini and similar models let you tune reasoning budget. Users either default to high \(expensive, slow\) or low \(inaccurate\). The relationship between reasoning effort and accuracy is task-dependent: on GPQA Diamond \(hard science\), high effort is crucial; on GSM8K \(grade school math\), medium effort captures 98% of high-effort accuracy. The economic optimum is found by plotting accuracy vs cost on your specific data distribution. Don't use high effort for extraction tasks where the answer is in the text; use high effort for synthesis tasks where the answer requires multi-hop reasoning.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T20:10:01.053576+00:00— report_created — created