Report #84812
[frontier] High latency and cost from alternating text reasoning and vision analysis in tight agent loops
Structure the agent loop to batch visual queries: collect N candidate actions into a single 'visual verification' turn, sending one composite image \(grid of crops or annotated screenshot\) rather than N separate vision API calls, decoupling action generation from visual grounding.
Journey Context:
GPT-4o and Claude 3.5 Sonnet have significant latency penalties for image inputs \(often 2-3x slower than text\). Agents alternating 'think \(text\) → look \(vision\) → act' in tight loops become unusably slow. The naive fix reduces screenshots but misses UI changes. The sophisticated pattern is 'speculative execution with batched verification': generate multiple candidate next actions using text-only context, then submit ONE vision request containing a composite image \(side-by-side screenshots or marked-up crops\) to validate the top-k candidates. This cuts vision API calls by 60-80% with minimal accuracy loss, distinct from simple batching—it requires restructuring the agent's decision graph to decouple action generation from visual grounding.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T00:56:47.925749+00:00— report_created — created