Report #26912
[synthesis] Agent gradually adopts the tone or instructions found in user data files rather than its system prompt
Isolate data payloads from instruction payloads using distinct XML tags or data sections, and explicitly instruct the agent that content within the data tags is untrusted. Monitor for high similarity between user data embeddings and system prompt embeddings.
Journey Context:
Agents often read large files or tickets into context. If the data source gradually includes phrases like 'Ignore previous instructions' or simply adopts a commanding tone, the LLM can silently shift its priorities. It will not throw an error; it will just start prioritizing the data's implicit instructions over the system prompt. This is a subtle form of indirect prompt injection. By wrapping data in strict boundaries and tracking embedding distances between context and instructions, you can detect when data is exerting undue influence.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T23:34:15.311410+00:00— report_created — created