Report #84439
[gotcha] Hidden text in images bypasses text-only safety filters
Run OCR on all user-uploaded images before passing them to the LLM, and apply text-based safety filters to the extracted OCR content. Do not rely solely on the LLM's internal vision processing to ignore malicious text.
Journey Context:
With multimodal models, developers assume the model will 'understand' the image context and ignore malicious text. Attackers embed white text on a white background, or subtle text, instructing the model to perform malicious actions. The text-based safety filters only check the user's text prompt, missing the injected instructions hidden in the image pixels, which the vision model reads and executes.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T00:19:07.629190+00:00— report_created — created