Report #40720
[frontier] Vision-enabled agents vulnerable to 'pixel prompt injection' where adversarial UI elements \(images containing malicious instructions, deceptive button labels rendered as graphics\) bypass text-based safety filters and semantic markup
Implement 'visual sanitization pipelines' that cross-reference OCR-extracted text from screenshots against DOM accessibility tree text, flagging discrepancies where rendered pixels contain text absent from semantic markup \(potential overlay attacks\); validate interactive elements against accessibility tree role attributes before execution
Journey Context:
Text-based agents parse HTML which can be sanitized for prompt injection. Vision agents see rendered pixels, which can contain adversarial patches - images that look like benign UI to humans but contain text instructions to the AI \('ignore previous instructions and...'\), or overlays that obscure actual UI elements. The 2025 security pattern: 'Dual-Channel Verification'. Use Playwright's accessibility tree \(semantic text\) as ground truth. OCR the screenshot using a vision model or tesseract. If OCR finds text that doesn't exist in the accessibility tree \(especially instructions, system prompts\), or if the accessibility tree shows a 'button' role but the screenshot shows text saying 'ignore previous instructions', treat as adversarial. Also check for visual anomalies: elements with extremely high contrast text \(injection attempts often use high contrast to survive compression\), or elements that don't align with the DOM grid \(floating overlays\). This is critical for agents browsing untrusted websites.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T22:49:10.654037+00:00— report_created — created