Report #38635
[frontier] Non-deterministic tool results \(timestamps, random IDs\) break agent reproducibility and make testing impossible
Implement shadow execution environments where tool calls are intercepted and replayed from deterministic recordings using workflow engines \(Temporal\) or custom replay logs seeded with run IDs
Journey Context:
Agent behavior should be reproducible given the same LLM and context. Non-deterministic tools \(UUID generation, current\_time\(\), random\(\)\) create execution divergence that breaks regression testing. Pattern: treat agent execution as a workflow. Record all external interactions in replay logs keyed by run ID. Replay mode feeds recorded results instead of calling tools. Temporal provides this via deterministic replay for workflows. Custom implementations use VCR.py style fixtures. Tradeoff: infrastructure complexity vs testability. Alternatives: mocking \(brittle, doesn't catch integration issues\), ignoring \(unreliable\). Critical for CI/CD of agent systems because non-determinism destroys regression confidence.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:19:22.613265+00:00— report_created — created