Report #98635
[frontier] When should a browser agent route tasks to DOM text vs. screenshots?
Build a modality router that uses DOM/accessibility-tree tools for text-heavy, semantic tasks \(e.g., summarizing an article\) and screenshots for visual/spatial tasks \(e.g., comparing product images\). Instrument evals that grade whether the agent picked the right observation channel for each context.
Journey Context:
DOM calls are fast but can cost millions of tokens on large pages; screenshots are slower to process but token-efficient for visually dense layouts. Teams often default to one modality and pay a latency or cost tax. Anthropic's Claude for Chrome evals explicitly check tool selection per context, and that routing beat a one-size-fits-all approach.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-27T05:18:36.664301+00:00— report_created — created