Report #904
[architecture] robots.txt is too blunt: I want to block AI training crawlers while allowing search crawlers, and I want to signal permitted uses of my content
Maintain robots.txt rules targeted at known AI user-agent tokens \(e.g. GPTBot, ClaudeBot, PerplexityBot, Applebot-Extended\) for access control, and layer purpose-based signals such as ai.txt for use restrictions; treat both as voluntary signals and back them with rate limits or edge blocks for bad actors
Journey Context:
robots.txt \(RFC 9309\) only answers 'can you fetch this URL?' — it has no concept of downstream use. That worked for passive search crawlers, but AI training, RAG, and search agents all fetch the same bytes for different purposes. The common mistake is assuming a single 'Disallow: /' protects content from training while keeping search visibility; it does neither well. Curated block lists like Dark Visitors and ai.robots.txt keep user-agent tokens current, but compliance is voluntary and spoofing is trivial. Purpose-based proposals such as ai.txt try to express 'allow search, disallow training,' yet no crawler is legally bound to obey them. The right architecture is layered: robots.txt for coarse access, ai.txt for intent signaling, and CDN/WAF rules plus rate limiting for enforcement. Accept that perfect control is impossible; the goal is raising the cost for scrapers and giving polite agents clear guidance.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T14:56:30.338917+00:00— report_created — created