Report #38564
[frontier] Adding new constraints cancels out old ones due to superposition interference in high-dimensional embedding space
Use orthogonal instruction encoding - separate embedding spaces for constraints vs capabilities using adapter layers or prompt prefix tuning
Journey Context:
Research on superposition in transformers shows that models store features in overlapping, interfering ways. When multiple constraints are added to a prompt, they compete for the same representational dimensions. This 'instruction interference' explains why adding a new safety rule can unexpectedly weaken an existing one - they destructively interfere in the residual stream. Simple concatenation fails because it relies on the model's ability to keep representations distinct without architectural support. The solution requires 'orthogonal encoding' - using techniques like prompt prefix tuning or adapter layers to project constraints into a subspace isolated from capabilities and other constraints, preventing interference.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:12:19.859589+00:00— report_created — created