GEO · llms.txt vs robots.txt

llms.txt vs robots.txt — what's the difference?

robots.txt blocks crawlers from URLs. llms.txt curates which URLs LLMs should ingest as canonical content. Both are plain-text files served from the site root. Both are read by crawlers. They do completely different jobs — robots.txt is access control (1994 spec), llms.txt is content curation (2024 spec). Most sites need both.

Audit both files in one scan. Free first scan.

The two files, side by side

/robots.txt

Purpose: tell crawlers what they CAN and CAN'T fetch
Spec age: 1994 (Robots Exclusion Protocol; RFC 9309 in 2022)
Audience: all crawlers — Googlebot, Bingbot, GPTBot, ClaudeBot, etc.
Honored by: all major crawlers (legal status varies; mostly de-facto compliance)
Format: User-agent + Disallow / Allow / Sitemap directives
Behavior: blocks access at the protocol level

User-agent: *
Disallow: /admin
Allow: /

Sitemap: /sitemap.xml

/llms.txt

Purpose: curate which pages LLMs should ingest as canonical content
Spec age: 2024 (llmstxt.org by Jeremy Howard)
Audience: AI / LLM crawlers specifically
Honored by: Perplexity, Anthropic, partial OpenAI; not Google/MS yet
Format: markdown — H1, blockquote positioning, sections with curated URL lists
Behavior: recommends content; does NOT block access

# Acme Inc.

> Acme builds X for Y.

## Core docs

- [/docs](https://acme.com/docs): Getting started

When you need each

You need robots.txt — always. Every site on the web should have one. Without it, all crawlers default to “fetch everything” which can leak admin URLs, staging endpoints, and unfinished pages into Google's index. Cost: 5 minutes once.

You need llms.txt — if AI-engine citation is a measurable channel for you. SaaS, content sites, documentation, and B2B businesses where prospects research via ChatGPT/Perplexity should have one. E-commerce stores (transactional intent, AI engines cite less) get less benefit. Cost: ~30 minutes to write the first version.

You need both, configured consistently — if llms.txt lists URLs that robots.txt disallows, the crawler honors robots.txt and ignores the llms.txt entry. Always cross-check both files. The Seoxpert llms.txt validator flags conflicts.

FAQ

Does llms.txt replace robots.txt?

No. They do different things and you need both. robots.txt is a 1994-vintage access-control file — it tells crawlers what they CAN and CAN'T fetch. llms.txt is a 2024-vintage curation file — it tells LLMs which pages are the most valuable canonical content on your site. robots.txt blocks; llms.txt recommends. A site can have robots.txt but no llms.txt (current default), or both (recommended for content sites that want AI citation), but not llms.txt without robots.txt — the access layer still matters.

Do AI engines actually read llms.txt?

It depends on the engine. As of mid-2026: Perplexity uses llms.txt to prioritize indexing. Anthropic's Claude reads it for content discovery. OpenAI has not publicly committed to honoring it but their crawlers do fetch it (you can see the requests in logs). Google and Microsoft have not endorsed it. So having one is a "small upside, no downside" decision — costs ~30 minutes to write, opens up small ranking boosts in the engines that respect it.

What format is llms.txt?

Plain markdown. The spec: a # H1 with the site name, a > blockquote with the one-sentence positioning, an optional intro paragraph, then ## sections with bulleted URL lists. Each URL line: `- [Title](URL): one-sentence description`. Aim for canonical landing pages, docs, and the most important blog posts — not every URL on the site (that's what sitemap.xml is for). See llmstxt.org for the full spec.

Can robots.txt and llms.txt conflict?

They can but shouldn't. If llms.txt lists a URL that robots.txt blocks, the crawler honors robots.txt (the access layer wins) and ignores the llms.txt entry. Run the seoxpert audit to catch this — it cross-references the two files and flags entries that point at robots.txt-disallowed URLs.

Where does each file live?

Both at the root: /robots.txt and /llms.txt. Both served as Content-Type: text/plain. Both apply only to the exact host they're served from (subdomains need their own copies). Both can reference external URLs (robots.txt via Sitemap: lines, llms.txt via the URL list). Cache headers should allow short revalidation — minutes to hours — since both are queried frequently and cheaply.

Audit both files

One scan validates robots.txt + llms.txt + cross-references them. Free first scan.

GEO hub · robots.txt checker · llms.txt validator · How to write llms.txt