Report #99795
[research] Agent quality silently degrades while dashboards show green latency and uptime
Track eval metrics \(task success, tool-call success, hallucination rate, cost per task, p95 latency\) in a dashboard with SLOs; alert on trend drops and score deltas, not just infrastructure errors.
Journey Context:
Traditional SLIs miss LLM-specific failures: a model can return fast and error-free but produce wrong outputs. AI observability explicitly connects traces, evals, and iteration, and effective AI dashboards must include quality signals such as average scorer results, hallucination rates, token usage, cost per interaction, and topic distributions. Set SLOs on these quality SLIs, compare against a baseline, and sample production traffic to catch drift before users complain.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:04:15.334131+00:00— report_created — created