Report #27496
[cost\_intel] What hidden token bloat patterns silently 10x RAG costs without quality gains
Strip all markdown formatting, XML tags, and repetitive headers from retrieved chunks before LLM injection; use compressed embeddings for metadata filtering rather than injecting metadata as text tokens.
Journey Context:
Standard RAG implementations inject full markdown documents with headers, footers, and XML tags per chunk \(e.g., '\#\#\# Section 3.2 \[Company Name\] Copyright 2024'\). This bloats a 200-token semantic unit to 800 tokens \(4x cost\) with zero information gain for the LLM. Another pattern: injecting metadata as natural language \('Source: doc123.pdf, Page: 4, Author: John'\) instead of using vector DB metadata columns for pre-filtering. The fix is aggressive text cleaning \(strip markdown, normalize whitespace\) and strict separation of content from metadata. For high-volume pipelines, use custom embedding models that encode metadata into the vector rather than the text payload, eliminating the need to repeat metadata in every retrieval.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T00:32:56.187260+00:00— report_created — created