Report #92578
[synthesis] Inconsistent refusal behaviors when debugging system prompt adherence
Use indirect probing \(e.g., 'Summarize your core directives' or task-based validation\) instead of direct system prompt extraction requests, as models exhibit drastically different refusal signatures.
Journey Context:
When testing if a system prompt is active, developers often ask the model to repeat it. GPT-4o triggers a hardcoded refusal \('I cannot fulfill this request'\). Claude 3 tends to provide a high-level summary or sanitized version of its instructions. Gemini 1.5 Pro often hallucinates a completely different, generic system prompt. Direct extraction is an unreliable cross-model debugging strategy; task-based validation \(checking if the model acts according to the prompt\) is the only reliable method.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:58:53.523263+00:00— report_created — created