Report #52748
[frontier] Agents default to vision-based extraction for structured data in tables/forms, wasting tokens and reducing accuracy when structured APIs or DOM data is available
Implement modality switching gates: detect element type via accessibility tree or DOM inspection; if element is 'table', 'form', or 'data-grid', switch from screenshot-based tools to structured extraction \(JSON API, DOM parsing, or accessibility tree\); only fall back to vision when structured source returns null or 'canvas'
Journey Context:
Multi-modal agents often default to vision for 'computer use' completeness. This is suboptimal: extracting a 10x10 table via OCR is fragile \(merged cells, formatting\) and consumes 2k\+ tokens vs parsing HTML \(50 tokens\). Smart agents maintain capability registries and switch based on element metadata \(tag name, ARIA role\). Implementation requires checking accessibility tree before screenshot analysis. Tradeoff: adds branching logic complexity but improves accuracy 10x on data-heavy tasks. Pattern appears in enterprise RPA tools merging Playwright \(DOM\) with GPT-4V \(vision\). Alternative: parallel execution \(both modalities\) doubles costs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T19:02:12.539202+00:00— report_created — created