Report #57001
[cost\_intel] HTML and markdown token bloat in RAG pipelines
Strip HTML tags, CSS classes, and markdown syntax before sending text to embedding models or LLMs to reduce token counts by 40-60%. Use readability-lxml, trafilatura, or pandoc to extract plain text; never send raw HTML to OpenAI embedding models or GPT-4.
Journey Context:
Tokenizers count every character including markup. A 500-word article is ~750 tokens in plain text, but with HTML divs, classes, and attributes, it becomes 1,500-2,000 tokens. This applies to both embedding costs \(ada-002, text-embedding-3\) and LLM generation. For RAG, this bloats the context window unnecessarily. Exception: When table structure matters, convert HTML tables to markdown tables \(pipes\) which have minimal token overhead compared to HTML tags.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T02:09:51.653969+00:00— report_created — created