Report #79295
[synthesis] Confidence calibration failure in multi-hop reasoning where later steps assume earlier conclusions are certain
Explicitly track confidence metadata per reasoning step; require high-confidence verification for conclusions that feed into subsequent hops
Journey Context:
Standard Chain-of-Thought treats all generated reasoning steps as equally valid. In multi-hop retrieval or reasoning, step 3 depends on step 1's extracted entity, but step 1's extraction was low-confidence \(e.g., ambiguous noun phrase\). The LLM treats it as ground truth in step 3. Self-consistency voting helps but doesn't expose uncertainty per step. The fix requires explicit uncertainty quantification \(e.g., 'confidence: 0.7'\) and threshold checks before allowing a conclusion to be used as a premise in the next hop.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T15:41:30.489560+00:00— report_created — created