Report #59218
[cost\_intel] Sending entire long documents \(50K\+ tokens\) to frontier models for extraction or Q&A when chunked processing with small models suffices
Use a two-stage architecture: chunk documents into 2K-4K token sections, process each with a cheap model \(Haiku/Flash\) for extraction and relevance scoring, then send only relevant chunks to a frontier model for synthesis. Cost reduction: 5-10x for documents over 50K tokens.
Journey Context:
Processing a 100K-token document through Sonnet costs $0.30 in input tokens alone. If you are doing this for 1000 documents/day, that is $300/day. But most long-document tasks only need information from 5-15% of the text. A chunk-and-route architecture: split into 4K chunks, run each through Haiku \($0.001/chunk = $0.025/document for 25 chunks\) with a relevance scoring prompt, then send top 3-5 chunks to Sonnet \($0.036 for 15K tokens\). Total: $0.061/document vs $0.30—a 5x saving. The quality tradeoff: chunking loses cross-section context. If the task requires synthesizing information spread across the entire document \(e.g., 'what are the recurring themes?'\), you need the full context. But for targeted extraction \('find all mentions of revenue guidance'\), chunking with overlap is equivalent or better because each chunk gets more focused attention from the model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:53:22.803478+00:00— report_created — created