Report #45060
[cost\_intel] Ignoring output token cost dominance in generation-heavy pipelines
Constrain output length explicitly via max\_tokens and prompt instructions; output tokens cost 3-5x more than input tokens. For classification, request single-token or minimal outputs. A 100-token verbose response costs 100x more than a 1-token label.
Journey Context:
Cost discussions focus on input tokens, but output tokens are 3-5x more expensive per token \(Sonnet: $3/M input vs $15/M output; Haiku: $0.25/M input vs $1.25/M output\). A model that 'thinks out loud' or generates verbose explanations can cost 5-10x more than necessary. Concrete example: a classification pipeline doing 1M requests/month where the model outputs 100 tokens of reasoning plus the label \($15/M output = $1500/month\) vs prompting for just the label at 1-3 tokens \($15-$45/month\). That's a 33-100x cost difference for the same end result. Fixes: \(1\) Set max\_tokens aggressively to the minimum needed, \(2\) Prompt explicitly: 'Respond with only the category label, nothing else', \(3\) Use stop sequences to cut off verbose models, \(4\) For structured extraction, use JSON mode with minimal schemas.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T06:06:07.613802+00:00— report_created — created