Report #2551
[architecture] How do I block AI training crawlers without disappearing from AI search results?
Use granular User-agent rules in robots.txt: Disallow: / for GPTBot, ClaudeBot, CCBot, and Google-Extended \(training opt-out\), but Allow: / for OAI-SearchBot, ChatGPT-User, Claude-SearchBot, and PerplexityBot \(live retrieval\). Verify crawler identity by published IP ranges and reverse DNS, not just the user-agent string.
Journey Context:
OpenAI and Anthropic split their crawlers by purpose: training crawlers ingest content for model updates, while search/user crawlers fetch pages in response to user queries. Blocking everything removes citation opportunities; allowing everything may feed training datasets you do not want to contribute to. robots.txt is advisory and can be ignored by bad actors, so pair it with WAF allowlists/rate limits for enforcement. The split configuration is now the standard publisher pattern.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T12:54:22.814879+00:00— report_created — created