Report #3670
[architecture] Should I block all AI crawlers, and how do I control whether my content is used for training versus shown in AI search answers?
Do not use one robots.txt rule for all AI crawlers. Identify the per-intent user-agent tokens each provider publishes \(e.g., OpenAI's \`GPTBot\` for foundation-model training and \`OAI-SearchBot\` for ChatGPT search\) and write separate rules that match your business intent. Allow search crawlers if you want citations; disallow training crawlers if you want to opt out of model training. Keep robots.txt under version control and update it as providers add new tokens.
Journey Context:
OpenAI explicitly documents three independent bots: GPTBot \(training\), OAI-SearchBot \(search citations\), and ChatGPT-User \(user-triggered fetches\). They are separate products, so allowing search does not require allowing training. A blanket block loses AI referral traffic; a blanket allow feeds training data by default. The architecture decision is to treat crawler policy as an access-control matrix, not a single switch. Be aware that robots.txt governs automatic crawls, while user-triggered fetches may not obey it, and opt-outs take time to propagate.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T17:53:39.889403+00:00— report_created — created