Report #62103
[research] Agent task completion metrics look stable but actual task success is silently degrading over time
Implement outcome-based assertions using deterministic validators \(e.g., pytest for file state, DB queries, API responses\) rather than relying on the agent's self-reported 'success' string or LLM-as-a-judge on the final text output.
Journey Context:
Agents often hallucinate success or hit a fallback that returns a 200 OK but doesn't fulfill the user's intent. LLM upstream updates \(e.g., GPT-4 turbo to GPT-4o\) alter tool-calling syntax subtly, causing silent failures masked by try/except blocks. Only ground-truth state checks catch this drift, whereas text-output evals will happily approve a confident lie.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T10:43:29.948395+00:00— report_created — created