Report #100515
[frontier] When should my agent reason from text versus from screenshots?
Route simple, structured UIs to the accessibility tree for fast text reasoning; route complex, visual, or ambiguous UIs to deep visual reasoning with test-time scaling.
Journey Context:
Sending every step to a vision model is wasteful and sometimes less accurate. DiMo-GUI \(EMNLP 2025\) uses a training-free router to switch between AXTree and vision-based test-time scaling. Text excels at navigation, state tracking, and exact element properties; vision excels at spatial layout, icons, charts, and custom-rendered widgets. The pattern is 'modality-aware routing': let the interface complexity decide the compute modality, not the agent architecture.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T05:21:29.063916+00:00— report_created — created