Report #46499
[frontier] Interleaving text reasoning and vision requests causes token count explosion and latency degradation
Enforce strict observation-action cycles: batch all visual inspections into single high-resolution screenshots taken only when the UI is stable, then perform all reasoning before next observation
Journey Context:
Developers naturally architect agents that 'think' \(text\), 'look' \(vision\), 'think', 'look'. Each vision request injects 4k-8k tokens \(high-res screenshots\). Anthropic's Computer Use API explicitly prevents this by enforcing a turn-based protocol: Agent outputs action → API takes screenshot → Agent receives observation. This batches vision into discrete windows, preventing the 'modal switch tax' of interleaved text/vision tokens.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T08:31:14.485296+00:00— report_created — created