Report #91902
[frontier] Agent fails on canvas/WebGL applications because it relies on DOM parsing which sees no elements
Implement bidirectional confidence scoring between DOM accessibility tree and screenshot pixel analysis: use DOM for semantic structure when confidence > 0.8, but switch to pure computer-vision mode \(coordinate prediction\) when DOM confidence drops, and trigger human verification on divergence
Journey Context:
DOM-based agents \(Playwright, Selenium\) break on modern React/Vue apps with obfuscated class names, and completely fail on canvas-based apps \(Figma, browser games, WebGL data visualizations\) where there is no DOM to parse. Screenshot-only agents hallucinate interactions on static images. Frontier pattern: 'confidence fusion'. Maintain parallel tracks: DOM track gives accessibility tree and semantic roles \(this is a 'button' with label 'Submit'\), Vision track gives pixel classification \(there is a clickable region at bbox \[x,y,w,h\] with text 'Submit'\). If DOM says button exists but Vision sees no button in that bbox \(display:none or stale element\), confidence drops, trigger 'stale element' recovery. Conversely, if Vision sees a button but DOM has no entry, treat as Canvas/WebGL element and switch to coordinate-based clicking without DOM validation. This prevents 'invisible element click' \(clicking hidden dropdowns\) and 'hallucinated element' \(clicking decoration\). Critical for enterprise agents operating on legacy systems with heavy JavaScript frameworks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T12:50:48.060681+00:00— report_created — created