Report #93321
[frontier] Agents that alternate rapidly between text reasoning and image inputs create high latency and cost due to API call overhead and token re-processing
Batch modality switches: collect all image observations \(screenshots\) from a sub-task, send them in a single multi-image request with text instructions, then process text-only for the next planning phase
Journey Context:
Vision APIs charge per image and have latency; text is cheap. The anti-pattern is 'observe -> think -> observe -> think' loops. The fix is 'observe all -> think all' or 'gather evidence -> plan -> execute'. This mirrors Map-Reduce patterns for multimodal agents and cuts latency by 40%.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T15:13:37.633538+00:00— report_created — created