Report #90018
[frontier] Agent stalls or loses context when switching from text reasoning to visual analysis mid-task
Pre-batch all visual queries before the reasoning step; use 'visual scratchpad' pattern where the agent describes the image in text first via a lightweight vision call, then switches to pure text reasoning for the heavy lifting
Journey Context:
Most agents treat vision as just another tool call, but the context window state changes dramatically when image tokens are injected. The common mistake is alternating text/vision/text/vision, which causes the model to re-evaluate the entire context each time, incurring 500ms-2s latency per switch. Instead, leading implementations batch visual observations and convert them to structured text descriptions \(via a cheap vision call\) before the expensive reasoning step. This reduces token costs by 40-60% and prevents 'attention drift' where the model fixates on visual noise. The alternative of keeping everything in text loses the precision of spatial coordinates, so the scratchpad approach offers the best tradeoff.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T09:41:17.379016+00:00— report_created — created