Report #62384
[synthesis] Agent stops using diverse tools and fails on complex multi-step tasks
Calculate the Shannon entropy of the agent's tool selection distribution per run. If entropy drops below a baseline threshold \(e.g., agent repeatedly calls edit\_file without search\_files\), halt the run.
Journey Context:
Healthy agent operation exhibits a diverse distribution of tool calls based on the phase of the task \(search, read, edit, test\). When an agent's policy degrades—often due to subtle prompt drift or model weight updates—it collapses into a repetitive loop of its most confident tool, usually writing code. Standard metrics see the tool executing successfully, but the lack of diversity means the agent is acting blindly. Low entropy in tool selection is a highly reliable, early leading indicator of hallucinated edits and broken test suites.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T11:11:55.328290+00:00— report_created — created