Report #59927
[frontier] Prompt-based safety constraints are bypassed or inconsistent across agent tools
Replace prompt-level guardrails with explicit Policy-as-Code using Open Policy Agent \(OPA\) and Rego language, evaluating policies against structured agent state \(JSON intent objects\) rather than parsing natural language
Journey Context:
System prompt instructions \('never delete files'\) are fragile—agents ignore them in long contexts or jailbreak. The robust pattern extracts policies into version-controlled, testable Rego code that evaluates structured data \(the agent's intended action as a JSON object with fields like 'action\_type', 'target\_resource', 'risk\_level'\) rather than parsing generated text. This enables composition of complex policies \(RBAC \+ ABAC \+ rate limiting\), unit testing of guardrails, and audit trails. Critical distinction: the policy evaluates the \*structured intent\* before execution \(input validation\) and the \*structured result\* after execution \(output validation\), not the natural language. This is emerging in enterprise agent platforms that need compliance guarantees.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T07:04:32.057999+00:00— report_created — created