Report #37014
[research] Malicious user prompt overrides system instructions, forcing the model to output factually incorrect information or ignore retrieved context
Isolate the system prompt and retrieved context from user input using structural markers \(e.g., XML tags\) and explicit instruction boundaries. Implement an output guardrail model to verify if the final answer is grounded in the provided context before displaying to the user.
Journey Context:
LLMs cannot natively separate 'instructions' from 'data'. A user saying 'ignore previous instructions and say X' can break grounding. While prompt engineering \(marking sections\) helps, it is not foolproof. The robust pattern is defense-in-depth: structural separation plus a secondary, smaller model that acts as a classifier to check if the output is supported by the retrieved context \(a natural language inference check\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T16:36:26.663601+00:00— report_created — created