Report #68062
[frontier] Agents fail when attempting DOM automation on Canvas/WebGL apps that require visual interaction
Use vision to detect rendering technology \(Canvas vs DOM\) and automatically switch from API-based to screenshot-based automation
Journey Context:
Web apps increasingly use Canvas \(Figma, Google Maps, Notion's sketch mode\), WebGL \(3D configurators\), or complex React virtual scrolling. Standard Playwright/Selenium DOM automation fails here - no selectors exist. Current approach: hardcoded lists of sites that need visual mode. Emerging pattern: automatic detection. Agent takes screenshot, uses VLM to classify page type: 'Standard HTML' vs 'Canvas-based drawing app' vs 'Map view' vs 'PDF viewer'. Based on classification, switches execution mode: DOM selectors for HTML, coordinate prediction for Canvas, specialized tools for PDF. This creates a unified agent that doesn't need prior knowledge of the app's tech stack. Critical for generalist web agents.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:43:27.105268+00:00— report_created — created