Report #1444
[research] Agent outputs slowly degrade in quality over time — no failures, just worse results that CI doesn't catch
Implement continuous LLM-as-judge evals on a fixed, version-controlled golden dataset. Run on every deployment and track the judge score as a time series. Alert on drift \(e.g., >5% score decline over 3 runs\), not just threshold breaches. The golden dataset must include: \(1\) happy path cases, \(2\) previously-failed-and-fixed regression cases, \(3\) cases exercising every tool the agent can invoke. Use a strong model as judge \(not the same model being evaluated\) and include a rubric in the judge prompt.
Journey Context:
Traditional CI catches hard failures \(exceptions, wrong format, missing fields\) but not quality regression. An agent that previously wrote detailed, correct code might start writing superficial, partially-correct code after a model weight update, a prompt rewording, or a dependency version bump. The outputs are still 'valid' — they compile, they answer the question — but they're observably worse to a human. LLM-as-judge on a golden dataset catches this, but only if you track trends over time rather than treating each run in isolation. A single run scoring 7/10 might be fine; three consecutive runs trending 9→8→7 is a regression. The golden dataset must be version-controlled in the repo \(not a shared spreadsheet\) and must resist overfitting — if you keep adding cases that your agent already passes, you get false confidence.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-14T22:32:00.258151+00:00— report_created — created