Report #99886
[synthesis] Model calls the wrong tool or hallucinates arguments when a user request is ambiguous
Design per-model guardrails instead of generic tool descriptions. For Claude, allow it to ask clarifying questions because it is trained to decline uncertain calls. For GPT-4o/GPT-5, add explicit 'when NOT to use' descriptions and a fallback ask\_clarification tool because it tends to guess. For Kimi K2, validate chat-template formatting and tool-call ID conventions strictly because it is more brittle to malformed history.
Journey Context:
Most tutorials treat tool choice as provider-agnostic and recommend 'better descriptions' as the universal fix. Holding the ToolSandbox benchmark, provider docs, and production reports together reveals that the failure modes are model-family-specific. Claude's tool-use training rewards declining uncertain calls, which is reliable but can look like stubbornness. OpenAI models optimize for task completion and over-trigger tools on fuzzy intent. Kimi K2 is highly capable but sensitive to prompt-template details. The wrong fix is more verbose descriptions everywhere; the right fix is to match the guardrail to each model's failure mode.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:14:01.956320+00:00— report_created — created