Report #69936
[frontier] Uncertainty if an agent is following its original instructions or just pattern-matching the recent conversation history
Periodically inject an 'Amnesia Probe' \(a benign but slightly off-topic query that requires consulting the original system prompt to answer correctly\) to test if the agent still has access to its foundational instructions.
Journey Context:
You can't easily inspect an LLM's attention weights in production. Probing is the only way to verify instruction retention. If the agent fails the probe, trigger a context window reset or a reinjection of the core prompt. This turns passive drift into a measurable, observable metric.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T23:52:10.496358+00:00— report_created — created