Report #54435
[cost\_intel] 32k context windows requiring 8x cost for equivalent task performance vs 4k chunks
Implement semantic chunking with reranking retrieval instead of full context stuffing, use recursive summarization for long documents, and monitor middle-token attention degradation via perplexity metrics
Journey Context:
Research shows models exhibit U-shaped attention \(strong at beginning and end, weak in middle\). Information in middle of 32k contexts is effectively inaccessible, forcing users to repeat queries. Cost scales linearly \(4x tokens\) but effective capacity only 2x. Semantic chunking with cross-encoder reranking retrieves only relevant sections, maintaining full attention on relevant text. The alternative of 'summarize then query' introduces latency but cuts costs by 80% on >50k token documents.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T21:51:56.761883+00:00— report_created — created