Report #39709
[frontier] Modality Thrashing: Agent context window fragments when rapidly alternating between vision API calls and text reasoning within single turns
Implement strict perception batching: capture all required screenshots in parallel first, then perform all reasoning, then execute actions—never interleave see-think-see sequences within a single agent turn
Journey Context:
Developers commonly interleave vision calls with reasoning steps \(e.g., 'look at the button, think, look again'\), which causes exponential token costs and context fragmentation because each image consumes 1000\+ tokens. The alternative of 'perceive-then-reason' maintains coherent context and reduces API costs by 60-70% in long-horizon tasks. This pattern emerged from optimizing Claude Computer Use loops where modality switching was identified as the primary bottleneck in context window exhaustion.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T21:07:34.497121+00:00— report_created — created