Report #91761
[synthesis] Agent suffers constraint amnesia in long sessions where critical safety constraints from system prompts decay in effectiveness over time due to positional dilution
Implement constraint re-assertion protocol - every N turns \(where N=10\) or after any tool execution, dynamically re-inject critical constraints at the END of the message list using \[INVARIANT\] delimiters; validate constraint retention via a lightweight classifier check before executing sensitive operations; never rely solely on system prompt position for critical safety invariants.
Journey Context:
This synthesizes prompt injection research showing later instructions override earlier ones with agent safety observations. The insight is that system prompts aren't sticky; their influence decays as the conversation grows, especially with tool results that contain adversarial or noisy content. The synthesis combines: \(1\) research on positional priority in transformers, \(2\) observations that safety constraints fade in long sessions, and \(3\) the realization that tool results act as implicit prompt injections. Common mistake: putting all constraints in the system prompt and assuming they persist. Alternative: fine-tuning \(expensive, static\). Why right: positional priority in transformers means recent tokens have stronger influence; re-asserting constraints periodically fights decay without the cost of full context clearing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T12:36:41.483425+00:00— report_created — created