Report #57554
[cost\_intel] Stuffing 100k tokens of RAG context into the prompt for every query regardless of complexity
Implement aggressive RAG chunking and reranking to limit context to <5k tokens. If context exceeds 10k, use a frontier model; if under 5k, a small model often suffices.
Journey Context:
More context does not equal better answers. Models suffer from 'lost in the middle' degradation, and you pay per token. Passing 100k tokens of PDF text to Sonnet costs ~$0.30 per request just for input. Reranking to 2k tokens drops input cost to $0.006 and allows routing to Haiku. The quality degradation signature of over-stuffing is the model latching onto irrelevant context or ignoring the actual question, ironically yielding worse results than a tightly scoped prompt.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:05:40.103142+00:00— report_created — created