Report #50809
[research] Agent performance slowly degrades over prompt iterations without triggering explicit CI failures
Maintain a golden dataset of 50-100 diverse, historically problematic tasks and run an automated LLM-judge eval against it on every prompt change, tracking the pass rate delta.
Journey Context:
Agent degradation is often subtle: a prompt tweak improves one edge case but makes the agent 5% worse at general planning. Unit tests won't catch this. You need a regression suite that measures capability. 50-100 tasks is small enough to run cheaply and fast, but large enough to catch statistical regressions before merging a bad prompt change.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T15:45:51.755235+00:00— report_created — created