Report #66252
[research] Agent silently degrades after minor prompt or model update
Implement a regression eval suite using exact-match or programmatic assertions on deterministic tool outputs \(CLI/API calls\) before deploying any change. Run this suite in CI/CD.
Journey Context:
People often rely on 'vibe checks' or LLM-as-a-judge for everything. But LLM judges are noisy and miss subtle tool-call regressions. The verifiability spectrum dictates that CLI/API tool calls are highly verifiable. By pinning exact tool call sequences or API payloads as golden datasets, you catch silent degradation that LLM judges would overlook. Only use LLM-as-a-judge for unstructured generation, not tool selection.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:40:48.151240+00:00— report_created — created