Seoxpert.io
Glossary/robots.txt
Glossary

What Is robots.txt?

Definition

robots.txt is a plain-text file at the root of a web host that tells crawlers which URL paths they are allowed — or not allowed — to fetch. It is the first file most well-behaved bots request when visiting a site.

The file implements the Robots Exclusion Protocol (REP), formalised by the IETF in RFC 9309. It is advisory: compliant crawlers obey it, but malicious bots will ignore it.

Location

The file must be served from the exact path /robots.txt at the root of each host you want to govern.

https://example.com/robots.txt            ← governs example.com
https://www.example.com/robots.txt        ← separate file for www subdomain
https://shop.example.com/robots.txt       ← separate file for shop subdomain

Each subdomain and each protocol (http vs https) is treated as its own host. A single robots.txt cannot govern multiple subdomains.

Syntax

The file is made up of one or more records. Each record names one or more user agents and a list of Disallow or Allow rules.

User-agent: *
Disallow: /admin/
Disallow: /cart
Allow: /admin/public/

User-agent: Googlebot
Disallow: /internal-search

Sitemap: https://example.com/sitemap.xml

User-agent: * applies to every crawler that does not have a more specific record. Paths support * wildcards and $ end-anchors for Google and most major crawlers.

What robots.txt Does NOT Do

  • It does not prevent indexing. A URL blocked in robots.txt can still appear in search results when other sites link to it — just without a snippet, because Google could not fetch the content. Use a noindex meta tag to keep a page out of the index. Critically, noindex only works if the page is crawlable, so do not block it in robots.txt.
  • It does not protect private data. robots.txt is a public file. Listing sensitive paths there reveals their existence to anyone who fetches it. Protect private resources with authentication, not robots rules.
  • It does not affect already-indexed URLs immediately. Blocking a previously indexed URL prevents re-crawling, so the indexed version may persist for weeks before Google drops it.

Common Mistakes

Blocking CSS or JavaScript

If Google cannot fetch your stylesheets or scripts, it cannot render the page correctly, which affects how it understands layout, mobile-friendliness, and Core Web Vitals. Never block /css/, /js/, or /_next/static/.

Blocking the entire site during launch

A staging Disallow: / accidentally pushed to production is one of the most common launch-day SEO disasters. Run a scan immediately after any deploy that touches infrastructure.

Using robots.txt to hide soft-404 or duplicate pages

Blocking duplicates in robots.txt prevents Google from seeing the canonical tag that would consolidate them. For duplicate or low-value pages, use canonical tags or noindex instead.

How Seoxpert Checks robots.txt

Every scan begins by fetching robots.txt and reporting its status, syntax errors, and effective rules. The crawler respects the file during its own crawl, so any issue that blocks Googlebot also blocks the scan. Common findings:

  • Missing file (404) — no crawl rules, full access assumed
  • Blocking all crawlers — likely a staging rule pushed to production
  • Blocking CSS or JS paths that Google needs to render the page
  • Sitemap directive pointing to a 404 or non-matching host
  • Syntax errors that cause directives to be silently ignored

See the robots.txt tester → to check a specific path against a live file, or run a full technical SEO audit.

Frequently Asked Questions

Does robots.txt remove pages from Google?

No. robots.txt only blocks crawling. A URL blocked there can still appear in search results if other sites link to it — just without a description. Use a noindex meta tag to prevent indexing.

Where should robots.txt live?

At the exact path /robots.txt on the root of every host. Subdomains need their own file — www.example.com/robots.txt does not govern shop.example.com.

Is robots.txt case-sensitive?

Path values are. /Admin/ and /admin/ are different paths. Directive names (User-agent, Disallow, Allow) are case-insensitive but are matched exactly as written against URLs.

What happens if robots.txt returns 5xx?

Google treats a 5xx response as a temporary block and may stop crawling until the file resolves. A 404 is treated as "no rules exist" and allows full crawling.

Can I use noindex in robots.txt?

No. Google deprecated the unofficial "Noindex:" directive in robots.txt in 2019. Use a meta tag or X-Robots-Tag HTTP header on the page itself.

Audit your robots.txt and every other crawl directive with a free scan.