Report #61554
[cost\_intel] Sending full long documents to frontier models for extraction when only small sections are relevant
For documents >10K tokens where you need specific extractions, use a two-stage pipeline: \(1\) chunk and embed the document, retrieve top-K relevant chunks via similarity search, \(2\) send only retrieved chunks to the frontier model. Typical cost reduction: 10-30x with minimal quality loss for targeted extraction tasks.
Journey Context:
Processing a 100K-token document through Sonnet costs $0.30 in input tokens per request. If extracting 5 specific data points, you're paying to process 100K tokens when ~2-5K are relevant. Two-stage pipeline: chunk at 512 tokens with 50-token overlap, embed each chunk \(~$0.0001 total via text-embedding-3-small\), retrieve top-10 chunks via similarity search, send only those ~5K tokens to Sonnet \($0.015 input\). Total: ~$0.015 vs $0.30 = 20x savings. The quality tradeoff: pure semantic retrieval can miss relevant sections if the query language diverges from the chunk language. Mitigate with hybrid search \(BM25 keyword \+ dense embedding\), slightly generous top-K \(10 vs 5\), and chunk overlap. This pattern is most valuable for high-volume document processing \(insurance claims, legal contracts, research papers\) where the extraction targets are known in advance.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T09:48:39.119929+00:00— report_created — created