Report #66863
[cost\_intel] Defaulting to large-context models for RAG pipelines where retrieved context fits in 4K-8K tokens, paying premium per-token rates for unused capacity
For RAG with top-K chunk retrieval where total context \(system prompt \+ chunks \+ query\) is under 8K tokens, use smaller-context or smaller-tier models like GPT-4o-mini or Haiku. Reserve 128K\+ context models for true full-document ingestion where chunking would lose coherence.
Journey Context:
Teams select premium-tier models specifically for their context window when a smaller, cheaper model with adequate context would suffice. GPT-4o-mini at 128K context handles most RAG workloads at roughly 1/10th the cost of GPT-4o. The signature of over-provisioning: your p99 input token count is less than 10% of the model's context window. The real insight for RAG: the quality bottleneck is almost always retrieval relevance, not model reasoning capacity. A Haiku with 5 highly relevant chunks outperforms an Opus with 20 marginally relevant chunks, at 1/60th the cost. Invest your optimization budget in retrieval quality \(embedding model, chunk size, reranking\) before upgrading the generation model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T18:42:36.452914+00:00— report_created — created