Report #26401
[cost\_intel] What token bloat patterns silently 10x costs in code processing pipelines?
Pre-process code files to strip comments, import statements, and docstrings before embedding or LLM processing; use tree-sitter to extract only function signatures and bodies under 512 tokens to avoid embedding noise that doubles vector DB costs.
Journey Context:
Developers embed entire raw code files, resulting in 80% of tokens being imports, license headers, and boilerplate. Example: A standard Python file with 100 lines of actual logic generates 3,000\+ tokens when including docstrings and imports. When chunked for embedding, overlapping windows duplicate these bloated sections. The fix requires AST-aware preprocessing: use tree-sitter to extract function/method nodes, strip docstrings \(unless semantically critical\), and normalize whitespace. This typically reduces token count by 60-70%, dropping embedding costs proportionally. Additionally, smaller chunks improve retrieval precision because the embedding model \(text-embedding-3-small\) has a 512-token sweet spot; beyond that, semantic dilution occurs where the embedding averages unrelated concepts \(imports mixed with logic\), degrading retrieval accuracy and forcing expensive re-ranking with larger models.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T22:43:01.457144+00:00— report_created — created