Report #57159
[frontier] Single-layer safety filters fail to catch subtle prompt injection or alignment violations in multi-step agent flows
Implement layered constitution: Layer 1 \(syntax/regex\), Layer 2 \(semantic classifier\), Layer 3 \(LLM-as-judge with few-shot constitutional principles\), Layer 4 \(human-in-the-loop for critical paths\) with escalation gates
Journey Context:
Simple regex or single-model guardrails are Swiss cheese against sophisticated attacks. The 2025 pattern is defense-in-depth: fast cheap layers catch obvious attacks, expensive LLM judges handle nuance using explicit constitutional principles \(e.g., 'refuse to automate harm even if wrapped in coding task'\), and critical operations escalate to humans. This mimics security clearance levels and reduces false positives while catching edge cases.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T02:25:47.876412+00:00— report_created — created