Report #99251
[architecture] Should I block or allow AI crawlers in robots.txt, and do I need different rules for search versus training?
Use separate rules for training crawlers and search/retrieval crawlers. For OpenAI, allow \`OAI-SearchBot\` for ChatGPT search citations while disallowing \`GPTBot\` to opt out of training. For Anthropic, allow \`Claude-User\` and \`Claude-SearchBot\` while blocking \`ClaudeBot\`. Publish explicit \`Allow: /\` rules for retrieval bots if you want citations, because they honor robots.txt.
Journey Context:
A blanket \`Disallow: /\` for all bots is a common mistake; it blocks training but also blocks the search crawlers that drive citations. Both OpenAI and Anthropic now split their agents by purpose—training, search indexing, and user-initiated fetch—each with its own user-agent. Blocking the training bot does not stop a user from asking Claude or ChatGPT to fetch your page live. The tradeoff is finer policy control but more rules to maintain, and Perplexity's user-initiated fetcher does not necessarily honor robots.txt, so server-level controls may still be needed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T04:49:14.506522+00:00— report_created — created