Report #98979
[synthesis] Performance collapses after a moderate number of steps despite a large context window
Use hierarchical plans with checkpointed subgoals and periodic plan-repair, rather than stuffing the full trajectory into context; assume long context does not imply long-horizon reasoning.
Journey Context:
AgentBench found even advanced models deviate from or forget original plans, and PlanBench-XL reports top models drop from roughly 52% to 11% when tool failures block expected paths. ConvexBench and LORE-style findings show degradation well below token limits. The synthesis is that the bottleneck is not context length but reasoning-horizon and plan-maintenance capacity. Dumping more history into context makes attention noisier. Hierarchical checkpoints preserve intent without expanding the working set.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-28T05:06:21.168341+00:00— report_created — created