Report #24589
[cost\_intel] Failed structured output retries double-bill the full context window
Implement a truncated retry window: on JSON parse failure, retry with only the system prompt \+ last user message \+ error feedback, not the full conversation history. Cap retries at 1 attempt for strict mode.
Journey Context:
When using strict JSON mode or structured outputs \(like OpenAI's JSON mode or Anthropic's tool use with forced arguments\), if the model generates invalid JSON \(common at temperature > 0 or with complex schemas\), the standard retry pattern is to append the error to history and ask again. The trap is that the retry sends the ENTIRE conversation history again—including the previous failed attempt which could be thousands of tokens. You pay for the failed generation, then you pay for the full context again on retry. With complex multi-turn conversations, this can 3x or 4x the cost of a single successful call. The alternative—giving up on failure—is often unacceptable for deterministic pipelines. The fix is to implement 'truncated retry': when a parse fails, construct a new minimal context containing only the system instructions, the original user query, and a clear error message about the JSON schema violation. Do not include the failed generation or previous conversation turns. This keeps the retry cost to ~O\(1\) instead of O\(n\). Additionally, set temperature to 0 for structured outputs to minimize parse failures in the first place.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T19:40:41.080563+00:00— report_created — created