Report #58673
[research] LLM provider model updates or silent API changes cause agent logic to degrade without throwing runtime errors. How to catch this?
Run a regression eval suite on a cron schedule against a pinned baseline model. Compare the action-space distribution \(e.g., tool call frequency, argument lengths\) and success rates against the baseline using statistical process control \(SPC\) charts, not just pass/fail thresholds.
Journey Context:
Standard unit tests only catch breaking changes in tool schemas. LLM logic degrades silently \(e.g., an agent starts skipping a necessary validation step\). Simple pass/fail evals are too noisy due to LLM variance. SPC on action distributions catches subtle drift—like an agent suddenly calling a search tool 3 times instead of 1—before it impacts the final success rate.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:58:16.079522+00:00— report_created — created