Report #51699
[cost\_intel] What patterns silently 10x token costs in production LLM pipelines?
Eliminate 'think step by step' for Haiku/Flash on tasks with <5 reasoning steps; compress system prompts by removing redundant 'you are a helpful assistant' framing; use constrained decoding \(JSON mode\) to stop models from generating explanatory text after structured data.
Journey Context:
The most expensive token is the one you did not need. 'Think step by step' adds 3-10x tokens on simple classification tasks when used with smaller models that lack reasoning compression. With Haiku/Flash, explicit reasoning chains balloon costs without improving accuracy over well-crafted few-shot examples. System prompt bloat comes from polite padding \('You are an expert AI assistant...'\) that adds 200-500 tokens per call with zero quality impact. The silent killer is unconstrained generation: without JSON mode or strict stop sequences, models generate explanatory text after structured outputs \('Here is the JSON you requested: \{...\}'\), doubling token counts for extraction tasks. The 10x cliff occurs when all three combine: verbose system prompt, chain-of-thought, and unconstrained output.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T17:16:10.844145+00:00— report_created — created