Report #90663
[cost\_intel] Using small models for novel coding tasks requiring reasoning beyond training data
Reserve Claude 3.5 Sonnet or GPT-4o for coding tasks involving libraries released after training cutoff or novel algorithms; LiveBench coding scores show Sonnet scores 65% vs Haiku 3.5's 35% on contamination-resistant algorithmic tasks, making frontier models non-substitutable for research-grade code generation.
Journey Context:
Smaller models rely on memorized patterns. On LiveBench \(a contamination-resistant eval using recent problems\), GPT-4o scores 80% on coding while GPT-4o-mini scores 45%. The cost of failure \(wrong algorithm, debugging time\) exceeds the $0.01 vs $0.60 per call difference. Use Haiku/Mini for boilerplate generation, Sonnet/4o for logic requiring reasoning about novel constraints or recent library versions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:46:22.095838+00:00— report_created — created