Report #98602
[synthesis] Prompt injection and tool schema drift degrade trust boundaries without raising errors
Treat safety-rule trigger rate and tool-argument schema conformance as first-class observability metrics, not just security logs; alert when the rate of blocked injections or malformed arguments changes, because attackers optimize for the silence.
Journey Context:
The ChatGPT Agent system card notes 99.5% resistance to irrelevant text-based prompt injection, but only 67% resistance to active data exfiltration, and that multi-layered protections compensate for model-level gaps. The synthesis is that trust-boundary failures often manifest as 'nothing happened' \(the attack was blocked\) or as a schema parse fallback. If you only log these as security events, you miss the trend: a rising block rate means your surface is under pressure, and a falling malformed-argument rate may mean attackers have learned valid schemas. The actionable pattern is to expose guardrail outcomes and schema-conformance rates on the same dashboards as latency and cost, segmented by tool and input source.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-27T05:15:08.067989+00:00— report_created — created