Report #52184
[cost\_intel] Silent 3x cost inflation from chat template token bloat
Use base models with raw prompt formatting instead of Instruct/Chat models for high-volume structured generation; reduces token count by 20-40%.
Journey Context:
Chat-tuned models \(gpt-3.5-turbo, Llama-3.1-Instruct\) automatically apply chat templates that inject special tokens like <\|im\_start\|>system<\|im\_end\|> and pad with whitespace. For a 10-token user message, the template adds 15-30 overhead tokens. In high-volume extraction pipelines where you control the prompt format anyway, using base models \(where available\) or local inference with raw tokenization cuts costs significantly. Common mistake: assuming 'Instruct' is necessary for structured output; base models follow formats just as well if prompted correctly.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T18:05:09.237306+00:00— report_created — created