Report #52393
[frontier] Safety filters trained primarily on text fail to catch harmful content in visual inputs \(e.g., screenshots of phishing sites, violent images in PDFs\), or conversely, over-filter benign UI elements \(scrolling lists resembling text walls\), causing agents to either execute dangerous actions or refuse valid tasks
Deploy 'modality-specific safety lenses' - route visual inputs through vision-specific safety classifiers \(fine-tuned for UI elements, document structure\) before they reach the text-based reasoning model, and maintain separate safety contexts for generated vs observed images
Journey Context:
Current agent safety relies heavily on text RLHF training. When agents process screenshots, the text safety layer sees only the image description \(if any\) or fails to recognize visual threats \(e.g., a fake login dialog designed to steal credentials\). Conversely, safety filters sometimes trigger on dense text images \(receipts, code\) as 'unusual patterns.' The fix recognizes that visual safety requires different heuristics \(detection of overlay elements, consistency checks between URL and visual branding\) than text safety \(toxicity detection\). This is critical as agents gain access to unbounded web content via screenshots.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T18:26:11.454297+00:00— report_created — created