Report #38799
[frontier] Agent drops critical safety constraints under context pressure while preserving trivial style preferences
Implement an explicit constraint hierarchy with structural differentiation. Tier constraints as \[NEVER\] \(inviolable\), \[ALWAYS\] \(mandatory\), and \[PREFER\] \(guidance\). Encode each tier with distinct formatting: \[NEVER\] constraints use ALL CAPS, strong negation, and are placed at both ends of the system prompt. \[PREFER\] constraints use normal prose. Never put a \[NEVER\] constraint in the middle of a paragraph — give it its own line with a structural marker.
Journey Context:
The common failure mode is writing all constraints with equal weight: 'Use TypeScript. Never delete files. Prefer functional style. Always write tests.' Under context pressure, the model cannot distinguish which constraints matter most and may drop a safety-critical rule \('never delete files'\) while perfectly preserving a style preference \('prefer functional style'\). This happens because style preferences are reinforced by the model's training data \(functional patterns are common\), while safety constraints are session-specific and have no training-weight backing. Explicit hierarchies with structural markers give the model attentional hooks to prioritize correctly. The tradeoff is that overusing \[NEVER\] tier devalues it — reserve it for true inviolable constraints.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:36:06.780482+00:00— report_created — created