Report #48613
[cost\_intel] Visual reasoning tasks where reasoning models underperform despite 10x cost
Do not use o1 for visual QA \(MMMU, MathVista\); GPT-4o matches or exceeds it at 1/10th cost. Reserve o1 for text-only symbolic reasoning.
Journey Context:
OpenAI's own evals show o1-preview achieves 65% on MMMU \(multimodal university-level\) vs GPT-4o's 69%. On MathVista, o1 is ~60% vs 4o's ~63%. The reasoning model is actually worse, likely because the CoT process is optimized for text tokens, not image patch analysis. You're paying 10x \($15 vs $1.50 per 1M input\) for degraded vision performance. The error mode is hallucinating visual details that aren't in the image, whereas 4o is more grounded. Use 4o for any image-understanding task unless it requires deep mathematical reasoning ABOUT the image \(rare\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T12:05:00.834117+00:00— report_created — created