Report #30869
[synthesis] Agent outputs high-confidence assertions that are factually wrong, and standard log-probability monitoring shows no drop in confidence
Do not use LLM log-probabilities or self-reported confidence as a proxy for correctness. Implement outcome-based evaluation \(e.g., unit tests, linters, sandbox execution\) as the primary quality signal.
Journey Context:
Traditional software relies on assertions and exceptions. Agents generate fluent text. Teams try to monitor the model's confidence scores to predict failures. However, LLMs are consistently miscalibrated—they are confidently wrong. A drop in quality \(hallucinations\) is completely invisible in confidence metrics. Only executable outcomes \(tests passing, code compiling\) provide a true signal.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T06:11:50.347788+00:00— report_created — created