Definition
robots.txt is a plain-text file at the root of a web host that tells crawlers which URL paths they are allowed — or not allowed — to fetch. It is the first file most well-behaved bots request when visiting a site.
The file implements the Robots Exclusion Protocol (REP), formalised by the IETF in RFC 9309. It is advisory: compliant crawlers obey it, but malicious bots will ignore it.
Location
The file must be served from the exact path /robots.txt at the root of each host you want to govern.
https://example.com/robots.txt ← governs example.com https://www.example.com/robots.txt ← separate file for www subdomain https://shop.example.com/robots.txt ← separate file for shop subdomain
Each subdomain and each protocol (http vs https) is treated as its own host. A single robots.txt cannot govern multiple subdomains.
Syntax
The file is made up of one or more records. Each record names one or more user agents and a list of Disallow or Allow rules.
User-agent: * Disallow: /admin/ Disallow: /cart Allow: /admin/public/ User-agent: Googlebot Disallow: /internal-search Sitemap: https://example.com/sitemap.xml
User-agent: * applies to every crawler that does not have a more specific record. Paths support * wildcards and $ end-anchors for Google and most major crawlers.
What robots.txt Does NOT Do
- —It does not prevent indexing. A URL blocked in robots.txt can still appear in search results when other sites link to it — just without a snippet, because Google could not fetch the content. Use a
noindexmeta tag to keep a page out of the index. Critically,noindexonly works if the page is crawlable, so do not block it in robots.txt. - —It does not protect private data. robots.txt is a public file. Listing sensitive paths there reveals their existence to anyone who fetches it. Protect private resources with authentication, not robots rules.
- —It does not affect already-indexed URLs immediately. Blocking a previously indexed URL prevents re-crawling, so the indexed version may persist for weeks before Google drops it.
Common Mistakes
Blocking CSS or JavaScript
If Google cannot fetch your stylesheets or scripts, it cannot render the page correctly, which affects how it understands layout, mobile-friendliness, and Core Web Vitals. Never block /css/, /js/, or /_next/static/.
Blocking the entire site during launch
A staging Disallow: / accidentally pushed to production is one of the most common launch-day SEO disasters. Run a scan immediately after any deploy that touches infrastructure.
Using robots.txt to hide soft-404 or duplicate pages
Blocking duplicates in robots.txt prevents Google from seeing the canonical tag that would consolidate them. For duplicate or low-value pages, use canonical tags or noindex instead.
How Seoxpert Checks robots.txt
Every scan begins by fetching robots.txt and reporting its status, syntax errors, and effective rules. The crawler respects the file during its own crawl, so any issue that blocks Googlebot also blocks the scan. Common findings:
- —Missing file (404) — no crawl rules, full access assumed
- —Blocking all crawlers — likely a staging rule pushed to production
- —Blocking CSS or JS paths that Google needs to render the page
- —Sitemap directive pointing to a 404 or non-matching host
- —Syntax errors that cause directives to be silently ignored
See the robots.txt tester → to check a specific path against a live file, or run a full technical SEO audit.