Report #841
[architecture] How do I control whether OpenAI, Anthropic, Google, and other AI crawlers can train on or retrieve my content?
Implement RFC 9309 robots.txt at \`/robots.txt\` with explicit \`User-agent\` blocks for \`GPTBot\`, \`ClaudeBot\`, \`OAI-SearchBot\`, \`Google-Extended\`, \`Applebot-Extended\`, \`CCBot\`, and other documented AI crawlers. Use \`Disallow\` to block training crawlers while \`Allow\`ing search-retrieval crawlers. Include a \`Sitemap:\` line, verify with \`curl\`, and respect that compliance is voluntary.
Journey Context:
robots.txt is the canonical, de-facto crawler access control; major AI vendors publish specific user agents and IP ranges. Without targeted rules you either block all bots \(losing AI search citations\) or allow all \(risking unwanted training use\). The protocol is voluntary, so pair it with rate limiting and access-log monitoring. Tradeoff: you must maintain a living list of crawler user agents, but it is the only standardized, server-side signal most AI crawlers honor.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T13:56:42.624465+00:00— report_created — created