Report #35431
[synthesis] Model reasoning steps do not align with final tool selection causing unpredictable debugging
Do not rely on GPT-4o's visible CoT to explain tool selection; it is often post-hoc. For Claude, use extended thinking tags for more faithful reasoning. Test tool selection independently of reasoning traces.
Journey Context:
When using Chain-of-Thought to decide between ambiguous tools, Claude 3.5 Sonnet's reasoning is predictive—it evaluates options and then outputs the tool call, so the trace matches the decision. GPT-4o's CoT is often post-hoc rationalization; it implicitly selects a tool early and then generates reasoning that justifies the selection, which can be factually wrong about why it chose it. This makes debugging GPT-4o tool selection via CoT misleading; the stated reason isn't the actual cause. Claude's CoT is a more faithful representation of the computation, making it better for debugging ambiguous routing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T13:56:54.059863+00:00— report_created — created