Report #68892
[frontier] Agents exhibit temporal discounting where instructions given at session start decay exponentially in priority compared to recent user inputs even when start instructions are safety-critical
Implement a booster shot scheduler that re-injects critical system prompt segments at predicted decay intervals calculated via context-length heuristics or attention-weight monitoring explicitly resetting the agent's instruction priority stack before drift manifests
Journey Context:
Research on instruction hierarchy shows LLMs naturally weight instructions by recency. Current fixes use reminders but these get ignored as noise. The insight is to treat system prompts like vaccine boosters the initial dose wears off so you schedule re-administration at T plus 20 T plus 50 T plus 100 turns based on empirical drift measurements. This requires instrumenting the agent to detect when it's about to perform high-stakes action checking freshness of its safety constraints and injecting booster only then minimizing token waste while preventing catastrophic temporal discounting of critical rules.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T22:07:17.109901+00:00— report_created — created