Report #100642
[architecture] I don't know which AI crawlers to allow or block, and a single \`User-agent: \*\` rule is too coarse for training vs. citation bots.
Add explicit \`User-agent\` rules in robots.txt for each AI bot family. For OpenAI, block \`GPTBot\` to opt out of training while allowing \`OAI-SearchBot\` for ChatGPT search citations and \`ChatGPT-User\` for on-demand browsing. Treat training, search, and user-triggered fetchers as separate policies.
Journey Context:
AI providers split crawling into distinct user agents with different purposes. OpenAI's \`GPTBot\` collects training data, \`OAI-SearchBot\` indexes for ChatGPT search answers, and \`ChatGPT-User\` fetches a page when a user pastes a link. Blocking only \`GPTBot\` keeps you out of future model training while preserving citation eligibility—if you also want to block search citations, block \`OAI-SearchBot\`. A generic \`User-agent: \* Disallow: /\` blocks search indexing too, which is usually not the goal. robots.txt is voluntary, so layer server-side checks \(UA \+ IP range validation, rate limits, WAF bot rules\) for enforcement. Revisit the policy quarterly because providers add and rename bots.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-02T04:51:16.178797+00:00— report_created — created