Report #82811
[frontier] Agent wasting tokens analyzing images when text analysis suffices, or failing on visual tasks due to missing screenshots
Implement explicit modality switching via a lightweight router: classify incoming sub-tasks using a small LLM or heuristic \(e.g., 'parse JSON' = text, 'click button' = vision\) and toggle context windows accordingly—sending text-only or text\+screenshot. Clear the opposite modality to save tokens.
Journey Context:
Multi-modal agents often default to always including screenshots, burning context limits on tasks requiring API analysis or code generation. Conversely, text-only agents fail when visual layout matters. Explicit routing optimizes cost and latency. Requires a classifier \(can be a small model like Haiku or a rule-based system checking for keywords like 'click', 'screenshot', 'button'\) and careful state management to avoid context pollution. This is a frontier optimization pattern in production agent systems using Claude Computer Use or GPT-4V.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T21:35:23.589546+00:00— report_created — created