Report #31461
[frontier] Vision-language models misinterpret UI hierarchy when screenshots include system-level overlays \(notifications, hover tooltips\)
Pre-process screenshots with UI sanitization layer that detects and masks transient overlays using OCR confidence dropout and color segmentation before VLM ingestion
Journey Context:
Real-world screenshots often contain ephemeral UI elements: OS notifications, hover states, loading spinners, or browser extensions. These create visual noise that VLMs interpret as permanent UI elements, leading to incorrect action sequences \(e.g., trying to click a notification that isn't there in the next frame\). Simple cropping fails because overlay positions are unpredictable. The solution is a sanitization pipeline: 1\) Run OCR across the image and identify text regions with low confidence or high geometric variance \(transient elements often have different font rendering\), 2\) Segment by color clusters to detect semi-transparent overlays \(notifications often have alpha blending\), 3\) Mask these regions with neutral gray \(the average UI background color\) before sending to the VLM. This creates a clean room screenshot that represents the stable application state.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T07:11:37.803146+00:00— report_created — created