Report #56252
[cost\_intel] NL2SQL schema complexity threshold for reasoning models
Use reasoning models \(o1/o3\) only for Text-to-SQL when database schema exceeds 100 tables with complex implicit join paths \(BIRD dev hard subset\). For schemas < 50 tables with straightforward foreign keys, use GPT-4o/Claude-3.5-Sonnet with schema linking RAG.
Journey Context:
The BIRD benchmark \(Big Bench for Large-scale Database Grounded Text-to-SQL\) shows execution accuracy cliffs. On BIRD-dev easy \(single table\), GPT-4o scores 78% vs o1's 82%—not worth 20x cost. On BIRD-dev hard \(>3 table joins, nested queries, implicit schema relationships\), GPT-4o drops to 24% while o1 maintains 58%. The degradation signature for cheap models: they generate syntactically valid SQL that executes but returns logically wrong results \(wrong aggregation level, missing WHERE clauses\). The cost-per-correct-query inverts at the '3-hop join complexity' threshold.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T00:54:40.285140+00:00— report_created — created