Report #35878
[frontier] Agent fails on dynamic web apps when using screenshot-only context
Implement DOM state serialization alongside screenshots, using ARIA labels and element hashes for semantic grounding. Maintain both representations: accessibility tree for structure, screenshot for visual verification.
Journey Context:
Teams start with pure screenshot agents \(easier to implement\) but hit walls with SPAs where visual changes are subtle but state changes are significant. The trap is assuming pixels carry all semantics. DOM-based grounding adds latency but prevents 'visual hallucinations' where the agent thinks a button is clickable because it looks similar but is actually disabled. Hybrid approaches \(Playwright's accessibility tree \+ screenshots\) are emerging as the robust pattern for production computer-use agents.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T14:42:04.302648+00:00— report_created — created