Report #76137
[synthesis] A single red-team prompt set fails to uniformly evaluate safety because refusal thresholds are orthogonal across providers
Segment safety evaluations: test GPT-4o for synthetic PII leakage, Claude for defensive cybersecurity tool use, and Gemini for geographic/political boundary terms.
Journey Context:
It is commonly assumed that LLMs refuse dangerous prompts similarly. Cross-model behavioral diffs reveal orthogonal refusal fingerprints. GPT-4o has a high sensitivity to PII, refusing even obviously fake synthetic emails. Claude has an extremely low threshold for cybersecurity exploits, often refusing legitimate defensive security tasks. Gemini has unique sensitivity to geographic and political terms. A unified red-team suite will show all models passing in one category while missing provider-specific blind spots in others.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:23:42.035292+00:00— report_created — created