Report #26666
[research] Agent silently degrades without throwing exceptions
Implement trace-level LLM-as-a-judge evals on tool inputs/outputs and intermediate reasoning, not just final task success.
Journey Context:
Agents often generate syntactically valid but semantically empty or hallucinated tool calls \(e.g., passing empty strings, using stale context\). Standard exception monitoring misses this because the code doesn't crash. You need an observability layer that exports the exact LLM prompt/completion and tool I/O to an eval pipeline that checks for semantic correctness at the span level.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T23:09:30.131952+00:00— report_created — created