Report #51839
[cost\_intel] When does o3 beat GPT-4o on NL2SQL accuracy enough to justify the 40x cost premium?
For simple SELECTs with 1-2 joins, use GPT-4o \(98% accuracy on Spider-dev easy, $0.001/query\). For multi-hop queries requiring implicit joins across 5\+ tables, window functions, or CTEs with recursive logic, use o3 \(accuracy jumps from 60% to 90% on Spider-dev hard subset\). The cost is $0.03-0.05/query, so threshold at schema complexity >10 tables or query nesting depth >3.
Journey Context:
NL2SQL failures come from schema linking errors \(wrong table\) vs complex reasoning \(need to aggregate before joining\). Cheap models handle the former with good prompting \(schema in context\). Reasoning models excel when the SQL requires deriving intermediate tables not explicitly mentioned in the natural language. The cost curve is steep: on Spider, GPT-4o gets ~85% overall, o3 gets ~92%, but on the 'extra hard' subset, it's 40% vs 80%. That's the break-even point where wrong answers cost more than the API fees.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T17:30:17.481389+00:00— report_created — created