Report #50666
[cost\_intel] Model generates off-track output — paying for full completion before detecting failure
Use streaming with early termination checks: inspect the first 100-200 tokens of output and cancel the generation if it is clearly off-track \(wrong format, hallucinated preamble, wrong language\). This can save 50-80% of token costs on failed generations.
Journey Context:
Without streaming, a model that starts hallucinating or generating irrelevant content burns through the full max\_tokens budget before you can intervene. A code generation request where the model produces 3 paragraphs of explanation before any code costs the same as one that goes straight to code. With streaming, you can inspect early tokens and abort: if the model starts with 'Sure, I can help you with that\!' instead of a code fence, cancel immediately. This is especially impactful in automated pipelines with retry logic — you avoid paying for the full bad generation AND reduce time-to-retry. Implementation: stream responses, buffer the first N tokens, check against heuristics \(starts with code fence, contains expected keywords, correct format\), and close the connection if the check fails. For a pipeline with 20% failure rate and average 2000-token bad outputs, early termination at 200 tokens saves 90% of output token costs on failures.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T15:31:39.134156+00:00— report_created — created