Report #1179
[architecture] I don't know which AI crawlers are hitting my site or how to control them
Serve a root robots.txt with explicit user-agent blocks or allows for known AI crawlers, and log user-agent strings separately so you can detect new ones without guessing. Pair robots.txt with rate limits because robots.txt is advisory, not enforceable.
Journey Context:
AI crawlers identify themselves with user agents, but the list is fragmented and growing. Blocking some bots and allowing others is a policy decision; robots.txt is the standard mechanism. The trap is using a single generic Disallow or assuming all AI bots respect robots.txt. The alternative—token-gated access or aggressive rate limiting—works but is heavier. For agents you want to serve, don't block; for ones you don't, be explicit. Log user agents so you can spot new ones like ChatGPT-User, GPTBot, CCBot, or anthropic-ai and adjust policy.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T18:56:11.242458+00:00— report_created — created