Seoxpert.io
Glossary/sitemap.xml
Glossary

What Is sitemap.xml?

sitemap.xml is an XML file that lists the URLs on a website so search engines can discover them. It implements the sitemaps.org protocol, originally drafted by Google, Yahoo, and Microsoft in 2005. Google enforces two hard limits per file: 50,000 URLs and 50 MB uncompressed.

Format

A minimal sitemap.xml:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/</loc>
    <lastmod>2026-05-20</lastmod>
    <changefreq>weekly</changefreq>
    <priority>1.0</priority>
  </url>
  <url>
    <loc>https://example.com/about</loc>
    <lastmod>2026-04-15</lastmod>
  </url>
</urlset>

Only <loc> is required. <lastmod> is widely used; <changefreq> and <priority> are mostly ignored by Google.

Sitemap index files

For sites larger than 50,000 URLs, split into multiple sitemap files referenced from a sitemap index:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap-pages.xml</loc>
    <lastmod>2026-05-20</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-blog.xml</loc>
    <lastmod>2026-05-20</lastmod>
  </sitemap>
</sitemapindex>

Common mistakes

  • Sitemap returns HTML instead of XML. Vercel / Next.js / SPA catch-all routes serve index.html for unmatched paths. Symptom: the file parses, but the root element is <html>, not <urlset>.
  • Future-dated lastmod. Almost always a server-time bug. Google may stop trusting lastmod across the entire sitemap.
  • Listing noindex / robots-blocked URLs. Sends conflicting signals — Google has to reconcile “you told me about this URL but also told me not to index it.”
  • Listing 404 URLs. Dead links train Google to trust the sitemap less.
  • Sitemap-index pointing at dead children. After a refactor that renamed a route, the index file still references the old child sitemap path.
  • Exceeding 50,000 URLs in one file. Google ignores the entire file. Split into a sitemap index.
  • Not referencing from robots.txt. Add Sitemap: https://example.com/sitemap.xml so crawlers discover it automatically.

sitemap.xml vs llms.txt

Both are URL-listing files for crawlers. sitemap.xml is exhaustive (machine-readable list of every URL for traditional search engines). llms.txt is curated (selective list of canonical pages for LLM ingestion). Most sites need both.

Related terms

  • robots.txt — reference the sitemap via Sitemap:.
  • llms.txt — curated URL list for AI crawlers.
  • Canonical tag — sitemap URLs should be the canonical version.
  • noindex — don't list noindex URLs in the sitemap.

Validate your sitemap.xml

Use the free sitemap checker— fetches the file, validates XML structure, flags future-dated lastmod, dead URLs, and Google's size limits.

Frequently asked questions

Where should sitemap.xml live?

Conventionally at /sitemap.xml on the site root, but the actual location is flexible — what matters is that the URL is referenced from robots.txt via a Sitemap: line, or submitted to Google Search Console / Bing Webmaster Tools directly. URLs inside the sitemap must be on the same host (or a host you can verify ownership of).

What are sitemap.xml's size limits?

Google documents two hard limits per sitemap file: 50,000 URLs and 50 MB uncompressed. Going over either causes Google to ignore the entire file. For sites larger than 50,000 URLs, split into multiple sitemap files and reference them from a sitemap index file (sitemap_index.xml).

Does sitemap.xml guarantee indexing?

No. A sitemap helps Google discover URLs, but it doesn't guarantee they'll be crawled or indexed. Google still applies its quality signals to decide which discovered URLs to index. A URL not in the sitemap can still be indexed if Googlebot finds it via internal or external links.

What is lastmod and does Google use it?

lastmod is an optional XML element indicating when the URL's content was last meaningfully changed. Google uses it as a crawl-priority signal but only trusts it if it reflects real content changes — sites that update lastmod on every regeneration (regardless of content change) train Google to ignore the field.

Why does my sitemap.xml return HTML?

Common Next.js / Vercel / SPA misconfiguration: the catch-all route serves index.html for unmatched paths, including /sitemap.xml. Google sees an HTML document at a URL that should be XML and ignores it. Fix: ensure /sitemap.xml has a dedicated route handler returning XML with Content-Type: application/xml, or generate it as a static file at build time.

Validate your sitemap.xml — free.