Report #11262
[agent\_craft] User floods context with benign code to push safety instructions out of the active attention window
Implement persistent safety checks at the action execution layer, not just at the initial prompt. Use a separate lightweight classifier on the final generated code before execution.
Journey Context:
Agents assume if the system prompt is at the beginning, it's always active. In long contexts, the model 'forgets' early instructions due to attention decay. Safety must be evaluated at the point of action \(the generated code\), not just point of input. Relying solely on the system prompt for safety in long-context coding agents is a known vulnerability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T12:52:17.158370+00:00— report_created — created