Seoxpert.io
highCrawl & Links

Sitemap URLs Blocked by robots.txt

This issue occurs when URLs included in your XML sitemap are also blocked by your site's robots.txt file. This creates conflicting instructions for search engin

By Seoxpert Editorial · Published · Updated

Why it matters

Sitemaps are intended to help search engines discover and prioritize important, indexable pages. If those same URLs are blocked in robots.txt, crawlers are told not to access them, which can prevent them from being indexed and waste crawl budget. This contradiction can result in poor visibility for key pages and inefficient crawling of your site.

Impact

Search engines may not crawl or index important pages, leading to reduced organic visibility. Crawl budget is wasted on URLs that cannot be accessed, and search engines may lose trust in your sitemap's accuracy, potentially ignoring it altogether.

How it's detected

This issue is typically detected by running a site audit with SEO tools (e.g., Google Search Console, Screaming Frog, Sitebulb) that compare your sitemap URLs against your robots.txt rules. Search Console will often flag 'Submitted URL blocked by robots.txt' errors under Coverage reports.

Common causes

  • robots.txt written to block staging paths that were repurposed
  • Global Disallow: / accidentally blocking sitemap URLs
  • Copy-pasting robots.txt rules from another environment without review
  • Bulk disallow rules that unintentionally match important URLs
  • Automated scripts updating sitemaps without syncing with robots.txt changes

How to fix it

Review your robots.txt file and sitemap.xml. Ensure that all URLs listed in your sitemap are allowed by robots.txt. Either remove blocked URLs from your sitemap, or update robots.txt to allow access to those URLs. Only include URLs in your sitemap that you want search engines to crawl and index.

Code examples

Problematic robots.txt and sitemap.xml

# robots.txt
User-agent: *
Disallow: /products/

# sitemap.xml
<urlset>
  <url>
    <loc>https://example.com/products/widget-1</loc>
  </url>
  <url>
    <loc>https://example.com/products/widget-2</loc>
  </url>
</urlset>
# The /products/ URLs are in the sitemap but blocked by robots.txt.

Fixed robots.txt (allowing sitemap URLs)

# robots.txt
User-agent: *
Disallow: /private/
# /products/ is no longer blocked, so sitemap URLs are crawlable.

Fixed sitemap.xml (removing blocked URLs)

# robots.txt
User-agent: *
Disallow: /products/

# sitemap.xml
<urlset>
  <!-- Removed /products/ URLs since they are blocked -->
</urlset>

FAQ

Why is it a problem if sitemap URLs are blocked by robots.txt?

It sends conflicting signals to search engines: the sitemap says 'please crawl and index this URL,' but robots.txt says 'do not crawl.' This can prevent important pages from being indexed and wastes crawl budget.

How do I check if my sitemap URLs are blocked by robots.txt?

Use tools like Google Search Console (Coverage report), Screaming Frog, or Sitebulb to cross-reference sitemap URLs with your robots.txt rules. Google Search Console will specifically flag 'Submitted URL blocked by robots.txt' errors.

Should I update robots.txt or the sitemap to fix this issue?

You should update either, depending on your intent. If the URLs should be indexed, update robots.txt to allow them. If they should not be indexed, remove them from the sitemap. Only include indexable URLs in your sitemap.

Can I use 'noindex' instead of robots.txt to block sitemap URLs?

'noindex' can prevent indexing, but if a URL is blocked by robots.txt, search engines can't see the 'noindex' directive. For sitemap URLs, it's best to allow crawling and use 'noindex' if you don't want them indexed, or remove them from the sitemap if they shouldn't be indexed at all.

What happens if I leave blocked URLs in my sitemap?

Search engines may ignore those URLs, flag errors in Search Console, and potentially distrust your sitemap, reducing its effectiveness for other URLs.

Found this issue on your site?

Run a scan to see if Sitemap URLs Blocked by robots.txt affects your pages.

Scan my website →