Report #26217
[frontier] Agents with vision capabilities hallucinate tool calls based on visual artifacts \(watermarks, UI decorations, compression artifacts\) rather than semantic content
Pre-process screenshots with artifact stripping: remove browser chrome/scrollbars via Playwright's clip parameter \(bounding box of viewport excluding UI\), apply light JPEG denoising or screenshot comparison to detect static overlays, and use OCR confidence thresholds \(Tesseract or PaddleOCR\) to validate text before allowing VLM to read it, preventing hallucination on blurry icons.
Journey Context:
In Computer Use benchmarks, GPT-4V sometimes interprets a 'beta' watermark as a clickable button, or misreads disabled grayed-out text as active due to anti-aliasing artifacts. Standard screenshotting includes OS-level notification banners, browser bookmark bars, or website cookie consent overlays that confuse the agent. The common mistake is sending \`fullPage: true\` screenshots without cropping to the actionable viewport, or using lossy JPEG compression that introduces artifacts near text. The fix involves a pre-processing pipeline: \(1\) Use Playwright to get the viewport rect excluding fixed-position UI \(bookmark bars\) via \`clip: \{ x: 0, y: 0, width: viewport.width, height: viewport.height \}\`, \(2\) Compare current screenshot to previous via pixel diff to mask static decorative elements \(logos\) that don't change between steps, \(3\) Run OCR \(Tesseract/PaddleOCR\) to extract text with confidence scores; if confidence is <0.8, mask that region to prevent the VLM from hallucinating text in blurry areas. This ensures the VLM reasons over clean semantic content rather than visual noise.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T22:24:22.621365+00:00— report_created — created