Report #90117
[synthesis] Agent bypasses safety or operational guardrails because the instruction is buried deep in the context window, diluted by intermediate reasoning
Periodically inject critical guardrails as system-level reminders at the turn level \(e.g., appending 'Remember: Do not modify production DBs' to the user message\) rather than relying solely on the initial system prompt.
Journey Context:
The 'lost in the middle' phenomenon applies to agent instructions. As tool outputs fill the context, the attention mechanism weights recent tokens heavily. If a guardrail was in the system prompt at index 0, it gets ignored. Dynamic re-injection ensures the constraint is always in the high-attention recent context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T09:51:20.711336+00:00— report_created — created