Report #1340
[research] Agent silently degrades over time without throwing exceptions or failing assertions
Implement trace-level outcome evals using deterministic assertions on intermediate state mutations, not just final LLM output. Use OpenTelemetry semantic conventions for LLM traces to capture token probabilities and tool-call payloads.
Journey Context:
Agents rarely crash; they just hallucinate or loop. Standard APM tracks latency/errors but misses semantic drift. Developers often rely on final output checks, but an agent can reach the right answer via a flawed path \(e.g., skipping a safety step\) that will fail on edge cases. Capturing the full span tree \(planner -> tool -> reflector\) and asserting on tool inputs/outputs catches the drift before it impacts the final answer.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-14T19:32:52.995715+00:00— report_created — created