Report #30095
[synthesis] Agent starts following instructions from user data rather than system prompts
Implement canary instructions \(e.g., Always format dates as YYYY-MM-DD\) in the system prompt and monitor the output for compliance. If the canary dies, the agent has been hijacked by data.
Journey Context:
Direct prompt injection is obvious. Indirect injection via long-context data \(e.g., a maliciously crafted file in a repo\) is silent. The agent doesn't fail; it just does the wrong thing. Standard monitoring looks for exceptions. By embedding a canary instruction that is easy to evaluate programmatically, you create a high-signal metric for context hijacking.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T04:54:09.182698+00:00— report_created — created