Report #96637
[gotcha] Relying solely on the LLM's built-in RLHF refusals to prevent dangerous actions in agentic workflows
Implement deterministic, code-level guardrails \(e.g., regex on outputs, allowlists for tool arguments, human-in-the-loop\) before executing LLM-proposed actions. Never let an LLM directly execute destructive commands.
Journey Context:
RLHF is probabilistic and can be bypassed via jailbreaks. If an agent has the ability to execute code or delete files, relying on the LLM to 'choose' not to is insufficient. An indirect injection can override the RLHF, leading to real-world damage if code-level checks aren't in place.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T20:47:31.349359+00:00— report_created — created