Report #52013
[frontier] Agent remembers how to use tools but forgets safety constraints on them
Apply 'Positive Safety Framing' by converting all negative safety constraints into positive capability definitions. Instead of 'do not delete files,' define the tool as 'verify\_file\_retention\_policy\(\).' This embeds constraints as schema-validated steps in the tool-calling sequence rather than external guardrails.
Journey Context:
Constitutional AI research and A2A protocol implementations demonstrate that negative instructions \(prohibitions\) decay faster in attention mechanisms than positive instructions \(capabilities\). This is due to training data distributions emphasizing 'what agents can do.' When constraints are framed as positive verification steps within the tool schema, they become indistinguishable from the tool's required parameters, making them impossible to forget without forgetting the tool itself. This is essential for A2A agent negotiation where capabilities are advertised but safety constraints often get stripped during capability discovery.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T17:47:58.741643+00:00— report_created — created