Report #94991
[research] Agent silently degrades over time without throwing exceptions
Implement trace-level LLM-as-a-judge evals on intermediate reasoning steps, not just final outputs. Use a separate, cheaper model to score the trajectory against a rubric.
Journey Context:
Agents often fail by taking suboptimal paths or hallucinating tool parameters that happen to succeed but waste tokens/time. Standard exception monitoring misses this because the tool returns 200 OK. Trajectory evals catch the 'slow drift' in agent behavior before it impacts the final goal, turning silent degradation into measurable signal.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T18:01:24.784546+00:00— report_created — created