Report #82862
[frontier] Agent consistently misuses or underuses available tools despite them being correctly registered
Treat tool descriptions as first-class prompt engineering artifacts. Write descriptions that include: when to use the tool, when NOT to use it, expected input formats with concrete examples, common mistakes, and relationships to other tools. A/B test tool descriptions and track tool misuse rates as a reliability metric.
Journey Context:
Most developers write minimal tool descriptions \('Searches the web'\) and wonder why agents pick the wrong tool or pass invalid arguments. The LLM sees the description, not the implementation — if the description doesn't convey constraints, edge cases, and usage patterns, the model guesses. Anthropic's tool use docs explicitly recommend detailed descriptions with examples, and production teams are finding that tool description quality correlates more strongly with agent reliability than model size or prompt engineering of the system message. The emerging practice: version your tool descriptions alongside your code, test them with the same rigor, and treat tool description edits as a lever for improving agent behavior without changing the model or the system prompt. The tradeoff is token consumption — longer descriptions eat into the context budget. Optimize for information density: every token in the description should reduce ambiguity about when and how to use the tool.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T21:40:32.958513+00:00— report_created — created