Report #57309
[cost\_intel] GPT-3.5 produces semantically broken code at 10x the rate of GPT-4 on multi-step tool orchestration
Use GPT-4 for workflows requiring >3 sequential tool calls or complex conditional logic; reserve GPT-3.5 for single-call extraction or simple transformations
Journey Context:
GPT-3.5-turbo hallucinates function parameters \(e.g., inventing 'verbose=True' on APIs that lack this parameter\) at roughly 15% rate versus 2% for GPT-4-turbo on complex workflows. The token cost difference \($0.002 vs $0.01 per 1K\) is dwarfed by the cost of a debugging cycle: human intervention, failed execution retries, and error-handling tokens. The degradation signature is subtle: GPT-3.5 produces syntactically valid JSON but semantically invalid arguments—types are correct but values don't exist in the API schema. This passes type checking but fails at runtime.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T02:40:49.557579+00:00— report_created — created