Report #94399
[frontier] Agents fail on complex multi-step reasoning tasks \(math, planning\) that require deep analysis before tool use
Use Claude 3.7's extended\_thinking mode with budget\_tokens parameter; set thinking type to enabled and allocate 20-40% of context window to thinking tokens; parse the thinking content block separately from the final response to audit reasoning chains
Journey Context:
Standard mode rushes to tool calls with flawed reasoning; extended thinking forces the model to work through logic chains explicitly before responding, drastically improving accuracy on complex coding and analysis tasks at the cost of latency and token usage.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T17:02:00.633932+00:00— report_created — created