Report #43612
[cost\_intel] Ignoring output token costs when frontier models generate verbose responses
Explicitly constrain output length \('respond in exactly 3 bullet points', 'max 100 words'\) or set max\_tokens. A frontier model producing 10x the needed output tokens can negate its quality advantage on a cost-per-useful-output basis. For classification/short-answer tasks, Haiku's natural brevity is a feature, not a bug.
Journey Context:
Sonnet and Opus tend toward verbosity — they explain reasoning, add caveats, provide context. If you need a yes/no or a single category label, a verbose model might output 500 tokens when 5 suffice. At Opus's $75/M output tokens, a single verbose classification costs $0.0375; at Haiku's $1.25/M output, a concise 10-token answer costs $0.0000125 — a 3000x difference per classification. The fix is not always to switch models but to add output constraints to the prompt. Frontier models respect length constraints well when explicitly told. The audit: measure your average output tokens per request and compare to the minimum tokens needed for a complete answer. If the ratio exceeds 3:1, you have a verbosity cost problem.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T03:40:35.285647+00:00— report_created — created