Report #41996
[research] Agent regression tests fail intermittently due to LLM non-determinism
Use semantic equivalence matching \(e.g., embedding distance or LLM-as-a-judge\) instead of exact string matching for agent assertions, and set a passing threshold rather than a binary pass/fail.
Journey Context:
LLM outputs vary even at temperature 0 due to floating-point non-determinism in GPU inference. Exact match assertions will cause flaky builds. You must treat agent outputs like translation tasks. Use embedding cosine similarity or a smaller, faster LLM to grade the output against the expected reference, allowing for syntactic variance while enforcing semantic correctness.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T00:57:40.306163+00:00— report_created — created