Report #50803
[research] Traditional CI/CD exact-match assertions fail on LLM agent outputs due to non-determinism
Build a regression suite using LLM-as-a-judge for end-to-end outcomes, but retain exact-match assertions for tool-call schemas and intermediate routing logic.
Journey Context:
If you use exact match for the final text output, CI will constantly fail. If you use LLM-as-a-judge for everything, it is too slow and expensive for intermediate steps. The hybrid approach is key: use exact match on the scaffolding \(did it call the right tool? did it extract a valid JSON shape?\) and LLM-as-a-judge only for the final free-text synthesis.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T15:45:33.064291+00:00— report_created — created