Report #99890
[synthesis] Agent succeeds on simple tool calls but fails on multi-step MCP workflows
Benchmark with real MCP servers before choosing a model. MCP-Atlas shows Claude Opus 4.5 leads on real-server workflows \(~62% pass\), GPT-5 trails \(~44.5%\), and GPT-4o/Kimi K2 Instruct are far lower \(~7-24%\). Use Opus-class models for complex MCP orchestration and smaller models only for narrow, single-tool tasks.
Journey Context:
Standard function-calling benchmarks make many models look similar. MCP-Atlas is the first large-scale benchmark using real MCP servers and shows a much wider spread than single-turn tool-use tests. The synthesis is that multi-step, real-server tool competency is not predicted by single-call accuracy. Teams often pick a fast cheap model after a small demo; the right call is to test end-to-end task pass rate on the actual MCP servers you will use.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:14:12.448233+00:00— report_created — created