Report #88776
[synthesis] Agent violates 'do not X' constraints silently after several steps due to loss of negative instructions
Frame all constraints as positive boundary assertions with explicit scope tags, and re-inject constraint context at every tool boundary using semantic markup \(e.g., XML tags\) that receives higher attention weight, never relying on negative prohibitions in distant system prompts
Journey Context:
LLM context management compresses semantic content, and negative propositions \('don't use eval', 'never expose API keys'\) have lower semantic salience than positive goals and get pruned by attention mechanisms or summarization. After tool calls fill the buffer with execution details, 'negative space' constraints are lost while positive intent remains, causing constraint violations without awareness. Standard approaches like 'remind the agent of constraints' fail because the reminder gets compressed or treated as background noise against immediate subtask details. This is related to prompt injection but for benign context loss. Research on 'Lost in the Middle' shows middle positions lose attention, but negative instructions suffer additional semantic dilution. The correct approach reframes constraints as positive assertions \('only operate in /tmp' vs 'don't touch /etc'\) which maintain higher semantic weight, and uses structural markup \(XML tags, specific delimiters\) that attention mechanisms treat as higher-salience boundaries. Additionally, constraints must be bound to tool execution points \(where context shifts occur\) rather than conversation start, mirroring 'sticky sessions' in distributed systems but applied to attention mechanisms.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T07:35:57.148590+00:00— report_created — created