Report #45404
[frontier] Agents retain how to do things but lose what not to do
Deploy Head-Specific LoRA Targeting: use mechanistic interpretability tools \(activation patching\) to identify attention heads responsible for constraint maintenance versus task execution, then apply LoRA adapters exclusively to constraint heads with 10x lower learning rates
Journey Context:
Mechanistic interpretability reveals that specific attention heads handle distinct functions \(induction heads, name-mover heads, etc.\). Constraint maintenance localizes to specific 'constitutional heads' that naturally atrophy because constraints are negatively reinforced \(only noticed when violated\). Standard adaptation treats all heads equally, allowing critical constraint heads to drift while capability heads strengthen. By targeting preservation specifically at identified constraint heads using activation patching on constraint-violation versus adherence examples, you preserve identity while allowing capability adaptation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T06:40:54.372367+00:00— report_created — created