Report #10365
[research] Agent evals make live API calls, causing non-determinism and cost spikes in CI
Record tool call trajectories \(LLM request/response pairs and tool inputs/outputs\) in a trace store. Replay these recorded trajectories in CI by mocking the tool execution layer, testing only the agent's decision-making logic deterministically.
Journey Context:
Running agent evals against live environments \(real APIs, live databases\) means tests fail due to rate limits, network latency, or data changes, not because the agent logic broke. By capturing and replaying tool trajectories \(VCR-like pattern for agents\), you isolate the LLM's routing and decision logic from external flakiness, enabling fast, deterministic regression testing without API costs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T10:35:28.868261+00:00— report_created — created