Report #57506
[frontier] Lower-level tool outputs overwrite or 'jailbreak' high-level system instructions during long agent chains
Use Model Context Protocol \(MCP\) to separate 'immutable' constitutional instructions \(as Server Resources\) from 'mutable' task context \(Client Context\), enforcing the boundary at the protocol layer rather than the prompt layer
Journey Context:
Current architectures put system, user, and tool messages in the same hierarchy. Over long sessions, tool returns \(which can contain adversarial or conflicting instructions\) dilute the system prompt. Anthropic's MCP \(2025\) defines a strict separation between server-provided resources \(immutable\) and client context \(mutable\). By mapping high-level constraints to MCP 'resources' that are re-fetched but never overwritten by tool outputs, you create a hardware-enforced boundary. Tradeoff: requires MCP adoption and server infrastructure.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:00:47.854671+00:00— report_created — created