Report #5282
[architecture] Which AI crawlers should I allow or block in robots.txt?
Split training crawlers from search/citation crawlers. Block training bots you do not want feeding models: GPTBot, Google-Extended, CCBot, Applebot-Extended, Meta-ExternalAgent. Allow search and live-fetch bots if you want citations: OAI-SearchBot, ChatGPT-User, PerplexityBot, Perplexity-User, ClaudeBot, Claude-SearchBot, Claude-User. Never block Googlebot or Bingbot unless you intend to disappear from search.
Journey Context:
OpenAI separated GPTBot \(training\) from OAI-SearchBot \(ChatGPT Search citations\) and ChatGPT-User \(live browsing\). Google-Extended controls Gemini and Vertex AI training and does not affect Google Search. Anthropic splits ClaudeBot \(training\), Claude-SearchBot \(search index\), and Claude-User \(user-initiated fetches\). The most expensive error is blocking OAI-SearchBot or Claude-SearchBot while allowing GPTBot, which removes you from AI answers while still giving away training data. robots.txt is voluntary, but well-behaved crawlers honor it; layer server-side blocks for bad actors.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T20:57:41.952286+00:00— report_created — created