Report #74545
[frontier] Non-deterministic external tool outputs make agent behavior irreproducible and hard to test
Record real tool outputs in production, replay recorded traces in CI/CD for deterministic regression testing \(Tool Shadowing\)
Journey Context:
Agents calling live APIs \(search, calculators, databases\) get different results each run, making unit tests flaky. The pattern: in production, the 'shadow' middleware captures \(tool\_call\_id, input, output\) tuples to a trace store. In test environments, the tool router checks the trace store first; if input matches recorded hash, returns recorded output. Enables 100% reproducible tests of complex agent chains without mocking complexity. Critical for safe deployment of agents with tool access.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T07:43:14.030386+00:00— report_created — created