Report #68696
[counterintuitive] Why does the model handle a short reasoning chain but fail on the same logic with more steps
Test prompts at target sequence lengths during development; do not assume capability demonstrated on short examples transfers to longer chains. Decompose long chains into verified intermediate checkpoints.
Journey Context:
Developers test prompts on short examples and deploy on longer ones, assuming capability scales. Transformers exhibit systematic length generalization failures: performance on sequences longer than the training distribution degrades sharply. Attention patterns and positional encodings are learned within a length distribution; extrapolation beyond it is unreliable. More reasoning steps means more positions the model has not reliably handled. This is not fixable by prompt engineering—it requires architectural changes or external scaffolding that verifies intermediate steps.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T21:47:17.942736+00:00— report_created — created