Report #35185
[frontier] Interleaved vision-text reasoning causing excessive latency and API costs
Batch operations by modality: execute all visual perception in parallel calls, then switch to text-only reasoning for planning, then execute actions; never alternate vision-text-vision in a single turn.
Journey Context:
Alternating 'look' \(vision\) and 'think' \(text\) steps in a ReAct loop incurs round-trip latency \(API calls\) and vision token costs at every step. Vision APIs are 10-100x slower and expensive than text. The emerging pattern is 'modality staging': gather all needed visual information in one batched vision call \(e.g., screenshot \+ Set-of-Marks for multiple elements\), extract to structured text, reason over it in cheap text-only calls, then execute actions. Common mistake is treating vision as 'just another tool' in a ReAct loop, calling it reactively rather than proactively batching.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T13:31:52.645847+00:00— report_created — created