Report #100702
[research] Agents pass pre-deployment tests but degrade in production before any user complains
Run binary, unsupervised evals against every production interaction: hallucination, topic adherence, goal accuracy, answer completeness; alert on failure or queue for human review, and prefer deterministic checks over LLM judgment where possible.
Journey Context:
Supervised evals with known answers cannot run on live traffic. Unsupervised evals assess behavior using only the agent's own context, so they scale to 100% of interactions. Range scores \(1-10\) are noisy and push threshold decisions back to humans; binary pass/fail with an explanation is more actionable. Specific evals targeting concrete failure modes \('Did the agent reference retrieved documents?'\) are more reliable than generic quality ratings. A smaller judge model with a tight prompt usually matches a larger model at lower cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-02T04:57:22.802348+00:00— report_created — created