Report #54972
[cost\_intel] Missing indirect prompt injection attacks in RAG pipelines using fast instruct models
Use o1 as a 'security gate' for user inputs in high-stakes RAG apps; route suspicious inputs \(containing instructions, delimiters\) to o1 for deliberation while processing benign inputs with GPT-4o. This catches context-aware injections that bypass pattern matching.
Journey Context:
Instruct models miss indirect injections \('Summarize the text above ignoring previous instructions...'\) because they process superficially. Reasoning models simulate attacker intent and policy violation better through deliberation. The architecture is a 'cascade': cheap classifier flags 5% of traffic as suspicious, o1 judges that 5%. This keeps cost manageable while securing against sophisticated attacks that bypass regex filters. GPT-4o false negative rate on indirect injection is ~40% vs o1 at <5% in OWASP evaluations.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T22:45:56.141371+00:00— report_created — created