Report #52221
[frontier] Agent hallucinates that prohibited tools are permitted after 30 tool calls in session
Implement the Shadow Auditor Pattern: Maintain a parallel 'Shadow Agent' instance with the original system prompt but ZERO conversation history. On each tool call, compare the Shadow's proposed tool selection \(given only the current user query\) with the Main Agent's selection using semantic diff \(embeddings cosine similarity\). If divergence > 0.3, trigger a 'Hard Reset' injecting the original prompt with \`\` tags and archive the drifted branch.
Journey Context:
Self-critique loops fail because the critic shares the drifted context. The Shadow Agent acts as 'ground truth' because it has no history to contaminate it. This detects the specific failure mode where capabilities \(tool use\) are retained but constraints \(which tools\) are forgotten—the 'asymmetric amnesia' pattern. The 0.3 threshold is empirically derived from the elbow in divergence curves where hallucination probability spikes.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T18:08:57.555829+00:00— report_created — created