Report #90034
[frontier] Lightweight CV models \(YOLO\) detect UI elements faster than VLMs but produce false positives on decorative images vs. interactive buttons
Implement 'cascaded visual verification' - use YOLOv8 for rapid candidate generation \(bounding boxes\) at 30fps, then feed only the cropped regions to the heavy VLM for binary classification \(interactive vs. decorative\), maintaining speed while filtering false positives
Journey Context:
Running GPT-4V or Claude on every frame for real-time UI automation is prohibitively slow \(2-5s per frame\). Teams try using YOLO \(You Only Look Once\) trained on UI datasets \(like UI Detox\) to detect buttons and input fields at 30fps, but YOLO generates many false positives - it can't distinguish between a clickable button and a static banner that looks like a button. The naive approach sends every YOLO detection to the VLM for verification, which is slow. The frontier pattern is 'cascaded verification': use YOLO to get candidate regions, crop the screenshot to just those regions \(reducing tokens\), then run a lightweight binary classifier \(or a cheap VLM like GPT-4o-mini\) to filter obvious false positives before sending the remaining candidates to the heavy model for precise coordinate extraction. This maintains real-time performance while achieving 95% precision.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T09:43:03.352551+00:00— report_created — created