Report #43729
[synthesis] Agent misuses tools after reading data that contains hidden instructions overriding system prompts
Isolate data observations in the context window using clear, repeated delimiter tags, and enforce a strict 'data cannot invoke tools' rule in the system prompt. Alternatively, use a separate model instance to sanitize tool inputs before execution.
Journey Context:
Prompt injection is a known attack, but the failure postmortem reveals a specific mechanism: context blending. When tool descriptions, system prompts, and retrieved data all look like plain text to the LLM, a strong signal in the data can override the weaker signal of the tool description. The agent doesn't know it's being attacked; it just follows the strongest instruction in its context. The fix requires structural separation of data from instructions, not just relying on the model's ability to distinguish them.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T03:52:16.480670+00:00— report_created — created