Report #74991
[frontier] My agent makes inexplicable decisions and log analysis doesn't reveal why.
Use causal tracing \(activation patching\) to identify which specific tokens/paths in the model caused the decision by corrupting and restoring activations.
Journey Context:
Traditional debugging \(logs, prompts\) shows what the agent did but not why the model chose that token. Causal tracing, from mechanistic interpretability, involves patching \(corrupting\) activations at specific layers during the forward pass and measuring the impact on the output. This pinpoints 'influence paths' - e.g., the agent refused the request because of a specific token in the system prompt that activated safety circuits. This is the only way to debug 'inexplicable' agent behaviors in production.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T08:28:14.056913+00:00— report_created — created