Report #59236
[architecture] Overconfident hallucinations cascading through agent chains
Implement calibrated confidence via self-consistency sampling: generate N outputs with temperature > 0, measure token-level agreement or semantic equivalence with embeddings; only auto-approve if consensus exceeds threshold, otherwise escalate to critique agent or human.
Journey Context:
Raw LLM logprobs are miscalibrated—high probability does not correlate with factual correctness. Simple thresholding on single-sample confidence fails. Self-consistency uses the fact that correct answers are more stable across stochastic samples than hallucinations. The cost is Nx inference. The alternative is training a separate verifier model, which is expensive. The escalation trigger must be set conservatively for irreversible actions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:55:14.395366+00:00— report_created — created