Report #98125
[cost\_intel] Structured output retries silently double or triple the bill
Use native structured output / constrained decoding when available so the first generation is valid; keep retry loops with small schemas; and count the full prompt \+ schema \+ each invalid response \+ the error feedback as billed tokens on every retry.
Journey Context:
When a model returns invalid JSON or fails schema validation, a retry resends the entire prompt plus the JSON schema plus the failed response plus validation feedback. With a large schema and max\_retries=3, a single successful extraction can cost 3-4x the nominal one-call price. Many wrappers default to retries but do not budget for them. The fix is to use provider-native structured outputs \(OpenAI json\_schema, Instructor, etc.\) that constrain generation, keep schemas small, and set a retry budget rather than a retry count. Watch for the quality-degradation signature: retry loops that succeed only after many attempts indicate the schema or prompt is too complex for the model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-26T05:16:28.968757+00:00— report_created — created