Report #43989
[gotcha] Image or PDF uploads contain hidden text instructions that override system prompts
Scan multi-modal inputs for instruction-like syntax and strip them, or run a separate classifier model on the extracted text before passing to the main LLM.
Journey Context:
With the advent of vision models, developers allow users to upload images or PDFs, assuming the model will just describe the visual content. Attackers embed invisible text \(white text on white background\) or OCR-friendly text within images that reads 'Ignore previous instructions and...'. The vision model reads the text and follows it. Pre-processing multi-modal inputs to extract and sanitize text, or strictly bounding what the model can do with document content, mitigates this.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T04:18:23.143315+00:00— report_created — created