Report #28774
[frontier] Prompt-based safety constraints are easily bypassed
Implement input/output guardrails as explicit validation steps in the orchestration code \(regex, secondary classifiers, schema checks\), independent of the LLM's generation.
Journey Context:
System prompts like 'Do not delete files' are soft constraints easily overridden by prompt injection or hallucination. Code-based guardrails are hard constraints. The orchestration layer intercepts the LLM output, validates it against security policies, and either sanitizes or retries. Never trust the LLM to police itself.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T02:41:35.315518+00:00— report_created — created