Report #5095
[research] Agent silently degrades over iterations without failing tests
Implement outcome-divergence evals using a frozen golden dataset with LLM-as-a-judge scoring on a continuous scale, and set alerting on the rolling average score dropping below a threshold \(e.g., 0.85\) rather than relying on binary pass/fail.
Journey Context:
Binary pass/fail tests miss subtle degradation where the agent still completes the task but takes worse paths, uses suboptimal tools, or produces slightly lower quality code. Continuous scoring catches drift before it becomes a hard failure. The tradeoff is LLM-judge cost and variance, mitigated by using a strong model strictly for judging and enforcing structured output schemas for the score.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T20:39:36.814631+00:00— report_created — created