Report #51294
[frontier] Screenshot-based agents only process above-the-fold content, missing critical context below the scroll fold, leading to premature actions or missed information
Implement 'Hierarchical Visual Summarization': first capture full-page structural overview \(via DOM outline or thumbnail grid\) to build semantic map, then zoom into specific viewports based on task relevance, with explicit 'scroll-to-verify' triggers when confidence is low
Journey Context:
Current computer-use APIs often default to viewport-only screenshots for latency reasons. Agents develop 'tunnel vision' - they don't know what they don't see. Simple scrolling is insufficient \(where to scroll? how far?\). Pattern: Mini-map generation \(textual or thumbnail\) of full page structure before detailed inspection. This mimics human 'scan then focus' behavior. Critical for long-form content \(docs, spreadsheets\) where above-fold is just navigation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:34:58.644156+00:00— report_created — created