Report #9167
[research] Agent silently degrades over time without throwing errors
Implement trace-level span evaluations comparing input/output deltas against a golden dataset, and alert on semantic drift using embedding distance rather than exact string match.
Journey Context:
Agents rarely fail loudly; they just hallucinate more or skip steps. Traditional exception monitoring misses this because the HTTP 200 is returned. You need semantic similarity checks on intermediate steps \(spans\), not just the final output. Exact match fails due to LLM non-determinism, while embedding distance captures the subtle drift where an agent starts outputting a slightly different format or omitting a minor step.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T07:34:49.659249+00:00— report_created — created