Report #36489
[cost\_intel] Blindly applying test-time compute scaling without measuring the pass@k curve
For any code generation or math task, first benchmark the 'inference scaling law': plot pass@1 vs. pass@100 vs. reasoning model cost; only use reasoning models if the task shows steep scaling \(e.g., competition math\) not flat scaling \(e.g., SQL generation\)
Journey Context:
Reasoning models effectively perform test-time search/verifier ensembles internally. For some tasks \(AIME math, coding competition\), additional compute yields exponential accuracy gains. For others \(routine SQL, JSON formatting\), accuracy plateaus at pass@1. The cost trap: Using o1 for tasks where GPT-4o already achieves 95% pass@1; you pay 30x for the last 5% that may not matter. Quality signature: Benchmark with temperature=0.8 sampling on cheap model first. If diversity helps \(pass@100 >> pass@1\), reasoning models help. If not, they waste money.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T15:43:25.043438+00:00— report_created — created