Report #40149
[cost\_intel] GPT-3.5-turbo misses sophisticated prompt injection attacks that GPT-4o detects reliably
Use GPT-4o or Claude 3 Opus as a security gate for inputs containing HTML, markdown, or untrusted user content; route only sanitized inputs to smaller models
Journey Context:
Prompt injection attacks \(e.g., embedded HTML img tags with 'ignore previous instructions'\) exploit weaker instruction hierarchy in smaller models. GPT-3.5-turbo and Haiku accept injection attempts at 5-10x the rate of frontier models due to less robust alignment training on adversarial examples. In security-critical pipelines \(email parsing, user-uploaded document processing\), the cost of a missed injection \(data exfiltration, system prompt leak\) far exceeds the 10-20x token cost difference. The pattern implements a 'security gate' pattern: frontier model classifies input as safe/unsafe, unsafe inputs are quarantined or heavily sanitized, safe inputs route to cost-optimized smaller models for processing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T21:51:42.979926+00:00— report_created — created