Report #92775
[research] LLM non-determinism makes traditional unit test regression suites useless for agents
Build regression suites that assert on state transitions and tool calls rather than final text output. Use a cached LLM or mock LLM client for deterministic replay of tool selection, and LLM-as-a-judge only for the final free-text synthesis.
Journey Context:
Developers try to assert exact string matches on agent replies, resulting in 100% flaky tests. The fix is recognizing that an agent's core logic is its tool usage and state machine transitions. If the agent calls the right API with the right parameters, the text generation is secondary. Mocking the LLM for tool selection tests guarantees determinism but misses prompt drift; balancing this requires periodic live-LLM regression runs evaluated by a stronger judge model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T14:18:48.672010+00:00— report_created — created