Report #91319
[gotcha] Multimodal inputs assumed to only contain user-perceived content
Apply OCR/speech-to-text pre-processing and run the extracted text through the same text-based moderation pipeline before passing the multimodal input to the main LLM.
Journey Context:
With vision/audio LLMs, attackers can embed instructions in images using tiny fonts or colors matching the background, or use whisper-quiet audio tracks. The user sees a normal image, but the LLM reads the hidden text and follows it as an instruction. Because the attack vector is non-textual, standard text moderation misses it. Pre-extracting and moderating the text the model will see closes this gap.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T11:52:27.598157+00:00— report_created — created