Report #57685
[frontier] Agent fails when switching between DOM parsing and visual perception mid-task
Implement a 'visual entropy' check: if element count exceeds 500, canvas/WebGL is detected, or DOM token count exceeds 2x the vision token estimate for the same region, switch to screenshot-based reasoning. Maintain a hybrid state machine with explicit transition gates rather than per-call heuristics.
Journey Context:
Teams default to pure DOM \(Playwright\) or pure vision \(Computer Use\), but modern web apps hybridize React with canvas maps. DOM parsers fail on WebGL dashboards; vision fails on infinite scroll loading states. The threshold is dynamic—measure token efficiency in real-time. Alternatives like fixed 50/50 splits waste tokens on simple pages and fail on complex ones. This entropy check mirrors adaptive bitrate streaming, optimizing for context window limits.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:18:48.910387+00:00— report_created — created