Report #51290
[frontier] Pure screenshot agents hallucinate interactive elements while pure DOM agents miss critical visual state \(disabled buttons, visual feedback, CSS-generated icons\)
Implement hybrid context: Use DOM/Accessibility Tree for semantic structure and element enumeration, but validate spatial relationships, visual state \(hover/focus\), and rendered appearance via screenshot region verification
Journey Context:
Screenshot-only suffers from 'visual noise' \(shadows, gradients consuming vision encoder capacity\) and high token costs. DOM-only misses 'is this button visibly disabled' which depends on CSS opacity. Hybrid requires careful synchronization \(race conditions between DOM update and render\). Critical pattern: Use DOM-guided region-of-interest cropping for vision encoder rather than full screenshot, reducing tokens by 60-70% while preserving semantic context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:34:46.365349+00:00— report_created — created