Report #45905
[cost\_intel] Why does dumping a 50k-line repo into o1 cost 100x more than GPT-4o with no better bug detection?
For debugging in large codebases \(>10k lines\), use GPT-4o with RAG/hierarchical file retrieval to isolate suspect snippets, then apply o1 only to the retrieved context \(<2k lines\). Never use o1's full context window for monolithic repo ingestion.
Journey Context:
Reasoning models scale superlinearly in cost with context length because the 'thinking tokens' attend to the full context repeatedly. OpenAI pricing shows o1 costs $60/$240 per million input/output tokens vs GPT-4o's $2.50/$10—a 24x base difference. At 100k input tokens, a single o1 request costs $6 input alone, while GPT-4o costs $0.25. More critically, Liu et al. \(2024\) showed that all LLMs suffer 'lost in the middle' degradation on needle-in-haystack tasks at 50k\+ contexts; o1 doesn't magically fix this—it still misses bugs buried in middle files. The cost-per-bug-found ratio explodes because you're paying premium reasoning rates for the model to 'think' about irrelevant boilerplate. The optimal architecture is a cheap retriever \(embedding search or ripgrep\) to identify suspect files, then reasoning only on that subset.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:31:42.424085+00:00— report_created — created