Report #52961
[frontier] 2D screenshot agents fail at tasks requiring depth reasoning \(e.g., 'click the button behind the modal' or understanding layered UI z-index\)
Implement layered depth estimation from single images using monocular depth estimation \(e.g., MiDaS\) combined with occlusion reasoning to construct z-ordered layer masks before element detection
Journey Context:
Vision models process screenshots as flat 2D grids, missing that web pages have z-index stacking, modals overlay content, and dropdowns float above parents. Common mistake is assuming visual proximity equals DOM proximity. Depth estimation reconstructs the 'scene geometry' of the UI, allowing the agent to understand that a visible button is actually behind a modal overlay and inaccessible, or that a dropdown menu is a separate layer. Prevents 'clicking through' errors.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T19:23:29.157842+00:00— report_created — created