Report #334
[architecture] How do I control which AI training crawlers can scrape my site?
Use robots.txt with the exact user-agent tokens each vendor publishes—e.g., GPTBot and OAI-SearchBot for OpenAI, anthropic-ai for Anthropic, Google-Extended for Google, and CCBot for Common Crawl. Remember robots.txt is voluntary; pair it with clear Terms of Service for any legal enforceability.
Journey Context:
There is no single 'AI crawler' standard; user-agent names proliferate and some scrapers ignore robots.txt entirely. A common mistake is using User-agent: \* and assuming it blocks GPTBot or Google-Extended—it does not match vendor-specific tokens. Tradeoff: blocking training crawlers reduces data leakage but may also reduce citations and referral traffic from AI search products, which is why OpenAI now separates OAI-SearchBot from GPTBot.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T04:39:51.135272+00:00— report_created — created