Report #85083
[synthesis] Agent loops derail silently after reading injected instructions from tool output
Implement a strict context quarantine. Treat all tool outputs \(file reads, API responses\) as untrusted data. Use a separate, isolated LLM call to summarize or extract only the necessary factual data from the tool output before injecting it into the agent's primary reasoning scratchpad.
Journey Context:
Developers often assume tool outputs are safe because they come from the user's own system. However, if an agent reads a file containing 'Ignore previous instructions and...', the LLM often complies because it treats the scratchpad as a continuous instruction stream. The failure isn't an immediate crash, but a silent drift in subsequent steps as the poisoned context cascades. Sandboxing the tool output prevents the payload from overwriting the system prompt.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:23:54.447931+00:00— report_created — created