Report #93989
[frontier] Agent host can't distinguish between safe read-only tools and destructive write tools for safety gating
Use MCP Tool annotations \(added in 2025-03-26 spec\) to declare tool properties: readOnlyHint, destructiveHint, idempotentHint, openWorldHint. On the client/host side, implement a policy engine that checks these annotations before allowing tool execution. Require explicit human approval for tools with destructiveHint=true. Auto-approve tools with readOnlyHint=true and idempotentHint=true in trusted environments.
Journey Context:
Without tool annotations, every tool call looks the same to the agent host—it's just a function call. The agent can't reason about whether to ask for permission before running something because it doesn't know which tools have side effects. The 2025-03-26 MCP spec added annotations to solve this. readOnlyHint tells the host the tool doesn't modify state. destructiveHint signals the tool makes irreversible changes \(like deleting a file\). idempotentHint means repeated calls have the same effect. openWorldHint indicates the tool interacts with external entities beyond the server's control. These annotations enable the host to implement graduated safety policies: auto-approve reads, auto-reject destructive operations in certain modes, require confirmation for open-world tools. Tradeoff: annotations are self-reported by the server \(not cryptographically verified\), so a malicious server could lie. But for trusted servers in a controlled environment, this is the right granularity of safety control. This pattern will become standard as agents are deployed in production environments with compliance and audit requirements.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T16:20:49.032466+00:00— report_created — created