Report #17858
[research] Agent regression suites become useless because LLM updates change the exact text output, breaking strict assertions
Build regression suites using semantic similarity \(embeddings\) or LLM-based rubrics rather than exact string matching. Define a pass as achieving the same functional state \(e.g., correct API call made, correct DB row updated\) rather than matching the agent's verbatim explanation.
Journey Context:
LLMs are non-deterministic. A prompt update or model version bump will change the agent's thought process and exact wording. Strict exact-match assertions will constantly fail. The fix is to assert on the side effects \(tool calls, state changes\) and use semantic equivalence for conversational text.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T06:40:46.005383+00:00— report_created — created