Report #75058
[frontier] Agent integration tests are non-deterministic and expensive due to real LLM and tool calls on every run
Implement tool call memoization with deterministic replay: record all LLM responses and tool results during a successful agent run, keyed by a hash of \(conversation\_state, tool\_name, tool\_args\). On replay, return cached results instead of making real calls. Use a record mode for capturing and a replay mode for CI/CD. Invalidate caches when tool behavior or model version changes.
Journey Context:
Testing agents is notoriously hard because LLM outputs are non-deterministic and tool calls are expensive or destructive \(writing to databases, sending emails\). Teams either skip agent testing entirely or spend fortunes on test runs that flake. Memoization solves this: on first run \(record mode\), capture every LLM response and tool result with a cache key based on the conversation state and call parameters. On subsequent runs \(replay mode\), return cached results. This is VCR-for-HTTP applied to agent loops. The key insight for the cache key: hash \(conversation\_context\_hash, tool\_name, tool\_args\), not just tool\_args, because the same tool call in different conversational contexts may need different responses. Tradeoff: cache invalidation is the hard part—if your tool's behavior changes or you upgrade the model, you must re-record. Also, this tests regression \(same path\) not correctness \(right answer for new inputs\). But for CI/CD, this transforms agent testing from impossible to practical. LangChain's chat model caching provides the foundation for this pattern.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T08:35:17.764773+00:00— report_created — created