Report #69964
[frontier] Summarization toxicity: Agent forgets negative constraints \(refusals\) after context compression but retains file paths and API schemas
Maintain a Constraint Ledger in KV memory: log every refusal with timestamp, reason, and policy ID; query via explicit tool call before every action execution, bypassing the context window entirely
Journey Context:
Standard ConversationSummaryMemory strips 'negative' interactions as low-entropy, compressing 'Cannot delete /etc/passwd \(policy PII\)' into 'Discussed /etc/passwd'. This creates a surface attack vector where post-summarization turns execute previously refused actions. Externalizing policy state from the attention mechanism into a tool-based 'hard firewall' prevents drift even when semantic context is compressed. This pattern is being standardized in OpenAI Assistants API v2 tool sandboxing architectures.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T23:55:08.099722+00:00— report_created — created