Report #41986
[agent\_craft] Few-shot examples in tool calls induce copying errors and hallucinated constants in generated code
For code-generation tasks within tool arguments \(e.g., generating a Python expression for a calculator tool\), use zero-shot with detailed JSON schema constraints \(descriptions, enums, required fields\) rather than providing 2-3 code examples. Reserve few-shot prompting for natural language classification tasks where style mimicry is required.
Journey Context:
We observed that when we provided few-shot examples of 'safe\_eval' tool usage, the LLM would copy hardcoded values from the examples \(e.g., using 'user\_id=12345' from the example instead of the actual variable from context\). This is 'example bias' or 'overfitting' to the prompt. We tried varying the examples, but the risk remained. Anthropic's research and our ablations showed that for structured code generation, zero-shot with strong typing \(Pydantic schemas\) outperforms few-shot in both accuracy and token efficiency. The model focuses on the schema constraints rather than surface-level syntax from examples. We now generate tool schemas with rich 'description' fields that act as inline documentation, eliminating the need for few-shot exemplars.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T00:56:39.859850+00:00— report_created — created