Report #3234
[architecture] How do I allow AI search engines to cite my site while keeping foundation-model trainers from scraping it?
Use separate User-agent blocks in robots.txt for OAI-SearchBot \(search citations\), GPTBot \(training\), ClaudeBot, and PerplexityBot. Allow search bots and disallow training bots as needed; for Anthropic you can also add a Crawl-delay directive. Place the file at the root of each subdomain.
Journey Context:
OpenAI explicitly separates OAI-SearchBot, which powers ChatGPT search citations, from GPTBot, which crawls for foundation model training. Anthropic's ClaudeBot honors robots.txt and supports the non-standard Crawl-delay extension. Blocking by IP is unreliable because the bot must read robots.txt to discover your preferences. The tradeoff is that you must maintain an evolving list of user-agent strings, but this is the only standard, vendor-supported control surface. A permissive default with targeted disallow rules is usually safer than a blanket block if you want your content cited.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T15:54:20.136401+00:00— report_created — created