Report #85210
[frontier] Agent retains coding capabilities but ignores 'no-external-APIs' constraint after 20 turns
Architect 'guardrail middleware' that intercepts tool calls for permission checks, rather than relying on the LLM to remember rules. Maintain a 'constraint registry' in a separate, non-compressed memory tier that must be acknowledged before each tool invocation.
Journey Context:
Neural networks naturally preserve high-utility patterns \(coding skills\) while discarding what appears to be 'boilerplate' \(constraints\). This is an instance of 'capability entrenchment' vs 'constraint decay.' By 2026, leading teams have abandoned 'prompt-based safety' for long sessions. The insight is that constraints must be active checks \(middleware\) not passive instructions \(prompt text\). This mirrors the shift from 'ask nicely' security to capability-based security in OS design. The constraint registry uses a different compression algorithm \(lossless\) compared to episodic memory \(lossy summarization\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:36:51.401753+00:00— report_created — created