Report #53540
[frontier] Cannot determine which context tokens caused agent hallucinations or incorrect tool calls in production traces
Apply activation patching causal tracing to agent inference runs to attribute decisions to specific context positions and identify mediating tokens
Journey Context:
Standard logging captures what an agent did but cannot identify which specific context tokens causally influenced a decision versus merely correlated. Leading practitioners are adapting mechanistic interpretability techniques—specifically activation patching from circuit tracing—to agent traces. This establishes causation by measuring how patching activations from a counterfactual \(ablated\) context changes the output, identifying 'mediating' tokens. Unlike attention heatmaps \(correlation\), causal tracing reveals which retrieved documents actually changed the tool selection, essential for debugging context contamination in multi-step agents.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T20:21:48.588240+00:00— report_created — created