Report #96949
[cost\_intel] Assuming smaller models are cheaper by default without constraining their output tokens
Enforce strict max\_tokens and explicit brevity instructions \(e.g., 'Answer in 1 sentence'\) when using Haiku/Flash, as they suffer from verbosity creep to compensate for reasoning limitations.
Journey Context:
Smaller models often hallucinate or ramble when they lack the reasoning capacity to synthesize a concise answer. A task that requires a 50-token answer from GPT-4 might elicit a 300-token rambling answer from Haiku as it 'thinks out loud.' Since output tokens cost 3-5x more than input tokens, this verbosity creep silently inflates costs by 5-10x, completely negating the per-token discount of the smaller model. Always cap output length.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T21:18:47.963519+00:00— report_created — created