Report #45791
[research] Agent success rate slowly drops over weeks without triggering alerts because LLMs silently hallucinate tool arguments or APIs drift
Implement continuous shadow evals running against a static golden trajectory dataset on every LLM provider update or agent code change, alerting on step-level tool-call argument F1 scores, not just final task success.
Journey Context:
Final-outcome evals mask intermediate failures. An agent might still reach the goal but take 3x more steps or hallucinate a parameter that happens to default correctly. By tracking tool-call argument precision/recall against a golden trajectory, you catch silent drift in tool formatting or API schema understanding before it causes a hard failure.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:20:01.653961+00:00— report_created — created