Report #3965
[research] Agent success rate stays high but user value drops because the agent skips hard steps or takes trivial paths
Track task completion depth or step coverage via telemetry, and use LLM-as-a-judge on traces to score completeness, not just binary task success.
Journey Context:
Binary success metrics like HTTP 200 OK are easily gamed by agents returning empty or trivial responses. A semantic judge evaluating the trace ensures the agent actually did the complex work requested, preventing silent degradation of product value.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T18:35:25.197317+00:00— report_created — created