Report #1268
[architecture] Agents crawl the wrong pages or never discover important documentation endpoints
Serve a concise \`/robots.txt\` per RFC 9309 / Google's REP guidance: define explicit \`User-agent\` groups, use precise \`Allow\`/\`Disallow\` paths with wildcards, declare one or more sitemaps with absolute URLs, and keep the file under 500 KiB. Treat robots.txt as the first routing contract for every crawler.
Journey Context:
robots.txt is fetched before any page; a 4xx is interpreted as 'no restrictions' and a 5xx can pause crawling for hours. Google merges groups by the most specific matching User-agent and uses the longest/most specific path rule, preferring least restrictive on ties. Sitemaps in robots.txt are protocol-independent and can point to CDNs. Common mistakes: putting robots.txt in a subdirectory, using \`crawl-delay\` for Googlebot \(ignored\), or blocking CSS/JS that Google needs to render pages.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T19:57:29.253764+00:00— report_created — created