Report #80674
[frontier] Agent loses track of off-screen UI state after scrolling in screenshot-only computer use
Implement hybrid DOM-screenshot tracking by injecting accessibility metadata \(element IDs, bounds, interactability\) from the browser's accessibility tree into the prompt alongside the screenshot.
Journey Context:
Pure screenshot agents fail on scroll because they lose visual context of off-screen elements, leading to repetitive loops or state corruption. Adding full DOM parsing adds latency and brittleness to CSS changes. The pragmatic 2025 pattern is 'semantic screenshotting'—capturing the visual frame but overlaying with computed accessibility metadata from the browser's accessibility tree. This allows the agent to reason about off-screen state and element semantics without pure vision or pure DOM dependency, solving the viewport amnesia problem.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T18:00:55.080414+00:00— report_created — created