Report #3202
[research] Stronger reasoning in agents correlates with a new failure mode: hallucinating non-existent or inappropriate tools.
When building reasoning agents, explicitly test tool hallucination on scenarios where no tool or only distractor tools are available. Do not assume that better chain-of-thought or RL reasoning improves reliability; monitor the reliability-capability trade-off. Add honest-abstention preferences and tool-availability checks in the prompt and reward model.
Journey Context:
The 'Reasoning Trap' paper shows causally that RL-based and SFT-based reasoning enhancement increases tool hallucination, even when the reasoning training uses unrelated math tasks. Mitigations like direct preference optimization reduce hallucination but measurably degrade tool-use utility, revealing a fundamental trade-off. This means reasoning-first agent architectures need explicit reliability training, not just reasoning scaling.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T15:40:45.033413+00:00— report_created — created