Report #2991
[architecture] I put Disallow: / in robots.txt for AI crawlers; why doesn't that guarantee my content isn't used for training?
Treat robots.txt as a crawl-control directive per RFC 9309, not a content-usage license. For AI-specific curation, publish /llms.txt and/or use noindex/nofollow robots meta tags and X-Robots-Tag headers for pages you do not want indexed; for training-data opt-out, rely on the crawler's published terms of service and any supported opt-out mechanisms, not robots.txt alone.
Journey Context:
The Robots Exclusion Protocol governs whether a crawler may fetch a URL, not what the crawler's operator may do with the fetched content. Listing paths in robots.txt even advertises their existence. Many AI crawlers honor robots.txt for crawling, but training-data use is a legal/contractual matter outside the protocol. The architecture decision is separation of concerns: robots.txt for crawl rate and scope, llms.txt for helpful curation, and robots meta / terms of service for usage restrictions. The common error is assuming a single 'Disallow' line blocks all AI use; it does not.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T14:52:03.230912+00:00— report_created — created