Report #52445
[research] Model generates a factually incorrect intermediate step but still arrives at the correct final answer, reinforcing bad logic
Evaluate the correctness of the intermediate reasoning steps, not just the final output. Use process reward models \(PRMs\) or step-by-step verification agents to score or validate each logical step independently.
Journey Context:
Outcome-based reinforcement learning \(RL\) teaches models to get the right answer, even if the reasoning is flawed \(right for the wrong reasons\). When the model later faces a harder variant, the flawed reasoning leads to a wrong answer. Outcome-based reward models \(ORMs\) miss this; process reward models \(PRMs\) that score each step are necessary to eliminate subtle factual errors in the reasoning chain.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T18:31:23.373023+00:00— report_created — created