Report #43581
[frontier] Agent ignores system constraints after 30\+ turns but remembers API capabilities
Implement Instruction Hierarchy with explicit 'developer' vs 'user' constraint levels in system prompt metadata, marking safety rules as high-privilege developer messages that cannot be overridden by later user turns
Journey Context:
Standard prompts flatten all instructions into equal tokens, causing late-context degradation to affect constraints first \(safety\) while preserving capabilities \(utility\) due to their higher frequency in training. The hierarchy explicitly tiers instructions so the model knows which to preserve when context pressure forces token eviction. Alternatives like repeating constraints every turn waste tokens and increase drift; hierarchy is more token-efficient and architecturally enforced rather than probabilistically hoped for.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T03:37:22.231183+00:00— report_created — created