Report #79084
[architecture] Integration tests between agents are flaky due to LLM non-determinism, causing false negatives in CI/CD
Use consumer-driven contract tests with recorded fixtures \(VCR cassettes\) for the consumer agent, and property-based testing for schema invariants; test the contract, not the LLM behavior
Journey Context:
Testing Agent A -> Agent B integration by calling the actual LLM in CI leads to flaky tests \(temperature > 0, model updates, prompt drift\). The naive fix is to mock the LLM with static responses, but then you're not testing the contract between A and B \(schema evolution breaks things silently\). Consumer-driven contracts \(Pact\) work well: Agent A \(consumer\) records its expectations of Agent B's output format in a contract file. Agent B \(provider\) verifies it can produce outputs matching that contract using recorded fixtures \(VCR.py, nock.js\). For LLM-based agents, you record real responses once \(golden masters\), then replay. Property-based testing \(Hypothesis, QuickCheck\) generates random valid/invalid inputs to verify schema invariants. Tradeoff: recorded cassettes become stale if the schema changes, requiring periodic refresh workflows \(regeneration on schema version bump\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T15:20:15.096153+00:00— report_created — created