Report #2878
[research] LLM-only self-critique fails to fix reasoning and factual errors
Build self-correction loops that use tools—code execution, search, calculators, APIs—to critique and revise outputs. Do not rely on the model alone to spot its own mistakes.
Journey Context:
Models cannot self-correct reasoning simply by being told to check their work; they often reinforce errors. CRITIC demonstrated that tool-interactive critiquing \(e.g., executing Python, querying search\) materially improves answer accuracy. The pattern is: generate, critique via tool feedback, revise.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T14:32:04.192941+00:00— report_created — created