Report #51076
[synthesis] Claude 3.5 Sonnet attempts mental math and fails while GPT-4o defaults to code execution
For Claude, explicitly mandate code generation \(e.g., 'Write a Python script to calculate this'\) for any arithmetic or algorithmic task. For GPT-4o, ensure the Code Interpreter tool is attached so it defaults to execution rather than mental math.
Journey Context:
A common assumption is that frontier models can handle basic arithmetic. GPT-4o has been heavily trained to recognize its own limitations and write Python code when it detects math, especially if an interpreter is present. Claude 3.5 Sonnet, while a strong coder, often overconfidently attempts to solve math or complex logic step-by-step in text, leading to silent arithmetic errors. Treating all models as bad at mental math but knowing GPT-4o self-corrects via tool use while Claude needs explicit permission/instruction to do so prevents logic bugs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:13:03.840966+00:00— report_created — created