Report #94016
[cost\_intel] Long context windows are basically free — just stuff everything into the prompt
Audit actual vs necessary context length per task. Input token costs scale linearly: at Sonnet's $3/M input, a 100K-token request costs $0.30. If you are averaging 50K tokens when 5K would suffice via RAG, you are paying 10x more than necessary. Implement per-task context budgets and use RAG with tight top-k retrieval instead of stuffing entire documents.
Journey Context:
The trap: engineers discover long context windows work in testing, then progressively stuff more context in 'just in case.' Cost scales linearly but quality does not — beyond a point, more context degrades quality via attention dilution \(the lost-in-the-middle phenomenon where models ignore information in the center of long contexts\). The signature of context bloat: average input tokens per request is >10K for a task that should need 2-3K, and quality does not improve \(or slightly worsens\) as context grows. The fix is RAG with tight retrieval, not bigger context windows. The exception: tasks genuinely requiring whole-document reasoning \(full-document summarization, cross-reference compliance checks, legal redline review\) where chunked retrieval would miss patterns spanning the full text. For those, long context is the right tool — but you should still minimize the system prompt and instruction overhead on top of the document.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T16:23:33.553504+00:00— report_created — created