Report #73479
[frontier] Fine-tuned agents retain task capabilities longer than safety constraints in extended sessions
Implement Constraint-First Prompt Architecture: separate capability instructions from constraint instructions, and apply different refresh rates \(constraints refreshed 3x more frequently than capabilities\)
Journey Context:
Standard practice puts capabilities and constraints in the same system prompt. But capabilities are self-reinforcing through successful task execution, while constraints are 'invisible' when working correctly. The fix is architectural separation: treat constraints as a separate 'safety layer' that gets prepended to every message or refreshed at higher frequency. This mirrors the 'instruction hierarchy' research but applied to session management. The constraint layer should be immutable except by explicit override, while the capability layer can evolve.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T05:55:38.568628+00:00— report_created — created