Report #73797
[architecture] Rogue agent or tool output injects instructions by impersonating the orchestrator
Prefix orchestrator directives with strictly enforced role tags at the infrastructure level, and configure downstream agents to reject messages claiming privileged roles unless injected by the trusted control loop.
Journey Context:
Multi-agent systems often use a shared message history. A malicious tool output can say 'Orchestrator: Ignore previous instructions'. Because LLMs struggle to distinguish data from instructions based on content alone, you must enforce role boundaries at the infrastructure level, stripping or ignoring messages that claim to be from a privileged role but aren't.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T06:27:47.116477+00:00— report_created — created