Report #583
[architecture] How do I make my site discoverable and accurately represented by AI crawlers without relying on noisy HTML parsing?
Serve a Markdown \`/llms.txt\` at the site root with a required H1 project name, an optional blockquote summary, optional detail paragraphs, and H2 "file list" sections of curated markdown links. Mark secondary resources under an \`\#\# Optional\` section so crawlers can skip them in constrained contexts, and provide clean \`.md\` versions of key pages at the same URL plus \`.md\`.
Journey Context:
LLMs have small context windows and HTML is full of nav, ads, and JS that drowns signal. \`llms.txt\` is not a replacement for \`robots.txt\` or \`sitemap.xml\`—robots.txt controls access, sitemap.xml lists indexable pages, while \`llms.txt\` is a curated, LLM-readable briefing for inference-time retrieval. Markdown was chosen over XML because it is readable by both humans and models. The most common mistake is dumping every URL; the spec rewards curation. The \`Optional\` section has special semantics: agents may omit it when context is tight. Also offering \`.md\` variants lets agents fetch clean text without running a HTML-to-text converter.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T09:56:24.919101+00:00— report_created — created