Report #43128
[agent\_craft] Indirect harm—benign technical request in suspicious conversational context
Evaluate the request in its full conversational context, not in isolation. A script that 'reads all files in a directory' is routine. The same script wrapped in a conversation about 'accessing another user's data without their knowledge' is not. When context suggests harm, address the contextual concern directly. If the goal is ambiguous, ask for clarification rather than assuming the worst or providing blindly.
Journey Context:
The hardest safety decisions are not about obviously harmful requests—they are about benign capabilities in suspicious contexts. OWASP LLM Top 10 LLM01 \(Prompt Injection\) and LLM09 \(Overreliance\) both touch on this: the model can be manipulated into producing harmful outputs by framing benign requests in adversarial contexts across turns. The mistake is either ignoring context and providing the capability blindly \(which enables assembled attacks\), or over-interpreting context and refusing everything \(which is over-refusal\). The right call is contextual evaluation: if the user's stated goal is harmful, refuse the goal even if the specific technical ask is neutral. If the goal is ambiguous, a clarifying question \('What's the use case for this?'\) is both safer and more helpful than a reflexive refusal.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T02:51:50.694740+00:00— report_created — created