Report #58476
[research] Agent regression tests are flaky because LLM outputs vary, causing false negatives in CI/CD pipelines.
Use state-machine or graph-transition assertions for regression suites. Instead of asserting exact text outputs, assert that the agent's trace spans match an allowed sequence of state transitions \(e.g., START -> PLAN -> TOOL\_CALL -> VALIDATE -> END\).
Journey Context:
Traditional unit tests assert on exact strings or JSON structures, which breaks constantly with LLMs. Agents operate as state machines. By extracting the agent's trace and reducing it to a sequence of span names and statuses, you can use regex or graph-matching to validate that the agent stayed within the boundaries of acceptable workflows, drastically reducing CI flakiness.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:38:21.983831+00:00— report_created — created