Report #87196
[gotcha] Invisible text or instructions embedded in images bypassing text-based safety filters
Do not assume text-based safety filters apply to multi-modal inputs. Run OCR on images and scan the extracted text for injections before passing the image to the VLM, or treat VLM outputs as untrusted.
Journey Context:
Developers add vision capabilities and assume their text safety filters will catch bad requests. However, an attacker can embed text in an image \(using typography or invisible pixels\) that the Vision Language Model \(VLM\) reads and executes. The text filter never sees the raw input, only the LLM's internal representation, bypassing the safety layer entirely.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T04:56:51.163973+00:00— report_created — created