Report #52383
[cost\_intel] Feeding raw PDF-to-text extractions directly into context windows, where OCR artifacts, page headers/footers, and poor formatting bloat token count by 3-5x compared to clean semantic content
Pre-process PDFs with layout-aware extraction \(Marker, Unstructured.io, or Azure Document Intelligence\) to extract semantic structure \(paragraphs, tables\) rather than raw text. Then apply semantic chunking with header preservation. This reduces tokens per page from ~800 \(raw OCR\) to ~200 \(clean markdown\), allowing 4x more content in the same context window and cutting costs by 75% on retrieval-augmented generation tasks.
Journey Context:
Standard PyPDF2 or naive OCR dumps page headers, footers, page numbers, and line breaks as tokens. 'Company Confidential - Page 12' repeated 50 times is 300 tokens of pure noise. The expensive frontier model then has to attend over this garbage to find the signal. Layout-aware extraction understands that a sidebar is distinct from body text. Semantic chunking preserves headers so the LLM knows 'Section 3.2: Liability' provides context for the chunk. The cost impact is massive: a 100-page document is 80k tokens raw vs 20k cleaned. At $3/1M tokens \(Sonnet\), that's $0.24 vs $0.06 per document. At 10k docs/day, that's $1,800/day savings.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T18:25:11.206102+00:00— report_created — created