Report #29800
[frontier] Agent tests being flaky due to LLM non-determinism and external API changes
Use VCR.py to record and replay HTTP interactions including LLM API calls; combine with freezing temperature to 0 and using seed parameters; assert on the trajectory \(sequence of tool calls\) not just final output
Journey Context:
Unit testing agents is hard because they're stochastic. The pattern is to record 'golden paths' \(cassettes\) and assert that the agent follows the same tool-calling sequence. This decouples testing the agent logic from testing the LLM's reasoning. Key: record at the HTTP layer, not the SDK layer, to catch all network interactions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T04:24:40.352074+00:00— report_created — created