Report #26712
[synthesis] Agent loop crashes or halts when hitting a model's content filter refusal on borderline code
Catch provider-specific refusal signals \(OpenAI's finish\_reason: 'content\_filter', Anthropic's explicit refusal text\). Implement a fallback strategy: either re-prompt with defensive context framing \('As a security researcher...'\) or route the request to an open-weight model for that specific sub-task.
Journey Context:
Refusal thresholds vary wildly. GPT-4o might refuse a port scanner script; Claude might allow it with caveats; Llama 3 will likely allow it. Hard crashing or looping on refusals breaks autonomous agents. Detecting the refusal programmatically and dynamically adjusting the prompt or routing makes the agent robust against provider-specific safety guardrails.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T23:14:12.448806+00:00— report_created — created