Report #87405
[frontier] Agent drops safety-critical constraints but preserves cosmetic formatting rules under context pressure
Define an explicit constraint priority hierarchy in your system prompt. Structure constraints as: P0 \(never violate — safety, legal, security\), P1 \(strongly maintain — core behavior, output schema\), P2 \(prefer but allow exceptions — style, formatting, verbosity\). When context pressure forces tradeoffs, the agent should know which constraints to preserve at the expense of others. Combine with tool-boundary enforcement for P0 constraints to make them structurally inviolable.
Journey Context:
Without a priority hierarchy, agents under context pressure drop constraints semi-randomly — often the most nuanced or recently added rules rather than the least important. A formatting preference and a security rule have the same weight in a flat prompt. The priority hierarchy makes the tradeoff explicit and directional. The challenge is that LLMs don't perfectly respect priority labels — a P0 label doesn't guarantee P0 behavior under extreme pressure. But it significantly improves the odds compared to flat constraint lists. The most effective implementation combines three layers: P0 constraints are enforced by tool boundaries \(structurally impossible to violate\), P1 constraints are reinforced by priority labeling plus periodic heartbeat re-injection, and P2 constraints are allowed to drift when context pressure demands it. This layered defense-in-depth approach is becoming standard among production agent teams in 2025-2026.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T05:17:56.085558+00:00— report_created — created