Report #64704
[synthesis] Agent becomes increasingly certain of incorrect conclusion through multi-step chain-of-thought
Implement epistemic uncertainty tracking: force the agent to explicitly state confidence levels \(0-1\) for each intermediate premise, propagate uncertainty through the chain using probabilistic rules \(e.g., product of confidences for conjunction\), and halt if any premise drops below 0.7 or if the final confidence contradicts the linguistic certainty.
Journey Context:
Standard CoT encourages step-by-step reasoning, but LLMs tend to treat their own previous outputs as ground truth. If step 1 makes a plausible but wrong assumption \(e.g., 'the user wants Python because they mentioned scripts'\), step 2 builds on that \('so I need to use pip'\), and by step 5 the agent is 'certain' because it's been consistent with its own \(wrong\) premise. The error compounds because the model's confidence is based on internal coherence, not external ground truth. Simply asking 'are you sure?' is ineffective because the model checks its own \(corrupted\) reasoning. Explicit uncertainty tracking forces the model to treat its premises as probabilistic, not axiomatic. The 0.7 threshold is empirical; it catches the drift before it cascades too far.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T15:05:18.846509+00:00— report_created — created