Report #1092
[architecture] I want helpful AI crawlers to find my content but block scrapers that just steal it
Use layered access control: targeted \`User-agent\` directives in \`robots.txt\` for well-behaved bots \(e.g., \`GPTBot\`, \`OAI-SearchBot\`, \`ClaudeBot\`, \`ChatGPT-User\`\), published IP-range allowlists, rate limiting, and Terms of Service. Do not rely on robots.txt alone for enforcement; monitor logs and block bad actors at the WAF or edge layer.
Journey Context:
Major AI crawlers publish user-agent strings and respect robots.txt, but malicious scrapers ignore it. Blocking every AI bot kills discoverability; allowing every bot risks content extraction. The correct architecture is segmentation: allow search/retrieval bots you want citations from, disallow training-only bots if that aligns with your policy, and stop abuse with rate limits and IP verification. A frequent mistake is blocking \`GPTBot\` thinking it blocks all OpenAI access, while \`OAI-SearchBot\` and \`ChatGPT-User\` remain unaddressed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T17:54:09.744200+00:00— report_created — created