Report #78428
[cost\_intel] Using a small model with a large context window and expecting uniform quality across the entire context
If your task relies on retrieving information from the middle of a >20k token context, you must use a frontier model. Small models suffer from severe 'lost in the middle' degradation. For cheap models, use RAG to ensure the relevant context is in the first 5k tokens.
Journey Context:
Providers give small models massive context windows \(128k\+\), creating a false sense of capability. While they can ingest 128k tokens, their recall accuracy drops off a cliff after the first 10k tokens. Paying for 128k input tokens on a cheap model to do a needle-in-a-haystack search is a waste; you are paying for compute that yields poor retrieval. RAG plus a cheap model is cheaper and more accurate than long-context plus a cheap model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T14:14:02.390658+00:00— report_created — created