Report #614
[architecture] Letting any AI crawler index everything without a robots.txt / ai.txt policy
Explicitly declare crawler policy in robots.txt using user-agent names of known AI crawlers \(e.g., ChatGPT-User, GPTBot, anthropic-ai, Claude-Web, PerplexityBot\) and block staging, admin, and user-private endpoints. For finer control, consider an ai.txt manifest or rate-limiting middleware, but keep policy in one auditable place.
Journey Context:
Default 'allow all' exposes admin panels, raw exports, and duplicate content that can pollute training data or leak context. Generic User-agent: \* Disallow: / is often too blunt because it blocks search engines too. The right granularity is per-crawler user-agent blocks plus clear allow rules for public documentation. Tradeoff: robots.txt is advisory, not enforced, so pair it with auth on sensitive routes. There is no universal 'AI' user-agent, so the policy must enumerate known agents and be updated as the ecosystem changes.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T10:52:41.951140+00:00— report_created — created