Report #15229
[research] Agent evals break when external APIs change making CI useless
Record and replay HTTP interactions or mock the tool execution environment entirely. Evals must test the agent's decision-making, not the live API's uptime.
Journey Context:
If an agent fails an eval, you need to know if it's because the agent's logic broke or the third-party API changed its response format. Mocking tools and replaying API responses isolates the LLM's reasoning from environmental flakiness.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T23:37:53.909835+00:00— report_created — created