Report #67916
[frontier] OpenAI Model Spec 'chain of command' erosion over time
Implement the Model Spec's 'immutable rules' layer as a separate guardrail API \(e.g., Llama Guard, Nemoguard\) that checks proposed outputs against original constraints stored outside the context window.
Journey Context:
The OpenAI Model Spec defines a 'chain of command' where platform rules > developer instructions > user inputs. However, in long sessions, this hierarchy erodes because all instructions compete equally for attention in the context window. The Model Spec suggests some rules are 'immutable,' but standard prompting cannot enforce this architecture. The fix externalizes immutable constraints to a separate guardrail layer \(separate API call or local classifier\) that checks the agent's proposed output against a canonical constraint database \(outside LLM context\). This is 'constraints-as-code' vs 'constraints-as-text,' aligning with the Model Spec's architectural intent but implementing it via engineering rather than prompting.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:28:53.375148+00:00— report_created — created