Report #2666
[research] Agent silently degrades over time without throwing exceptions
Implement outcome-based assertions on tool outputs and final state, not just execution completion. Use shadow deployments with canary prompts to compare tool selection distributions.
Journey Context:
Agents often return 200 OK but accomplish the wrong task due to upstream model weight updates or subtle API response schema changes. Checking for exceptions or 'completed' statuses gives a false sense of security. You must assert on the verifiable side effects \(e.g., file diff, DB state\) rather than the agent's self-reported success.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T13:33:49.460647+00:00— report_created — created