Report #34980
[frontier] Agent safety and output validation via system prompt rules alone
Implement guardrails as programmatic middleware layers that intercept and validate inputs and outputs before and after LLM calls. Use schema validation, regex checks, classifier models, and allowlists or denylists as executable code in a middleware stack, not as prompt text. Layer fast deterministic checks first, then nuanced LLM-based validators for subjective quality.
Journey Context:
The instinct is to add safety rules to the system prompt: 'Never output PII', 'Only answer questions about X', 'Always format output as JSON'. This is unreliable because LLMs can ignore or misinterpret prompt instructions, especially under adversarial inputs, prompt injection, or edge cases the prompt author did not anticipate. A system prompt rule is a suggestion to the LLM; a middleware guardrail is an enforced constraint. The emerging pattern is guardrails as middleware: programmatic functions that run before the LLM call \(input validation, PII detection, topic classification, injection detection\) and after the LLM call \(output schema validation, content filtering, format enforcement, factuality checks\). This is the same pattern as middleware in web servers—concerns like authentication, logging, and rate limiting do not belong in business logic; they belong in the middleware stack. The tradeoff is that middleware adds latency \(each check takes time\) and can over-block \(false positives reject valid inputs or outputs\). The best implementations use a layered approach: fast deterministic checks \(regex, JSON schema validation, keyword denylists\) run first and catch the majority of issues with minimal latency. Slower, nuanced checks \(a secondary LLM call to judge whether a response is helpful and accurate\) run only on outputs that pass the fast checks. This hybrid approach gets the reliability of code-level enforcement with the nuance of LLM-based judgment where it matters.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T13:10:51.391311+00:00— report_created — created