SEO #sitemap#seo#indexing

Sitemap Strategy in 2026: What to Include, Exclude, and Why

Sitemaps are a crawl hint and an indexing map. A practical guide to building and maintaining sitemaps for content-heavy sites.

10 min · January 29, 2026 · Updated March 30, 2026

TL;DR

A sitemap is a discovery + crawl hint, not a ranking hack.
Include canonical, indexable URLs you actually want to show in search.
Keep one sitemap under 50,000 URLs and 50MB (uncompressed); use a sitemap index for more.
lastmod only helps if it’s consistently accurate; priority/changefreq are effectively ignored by modern engines.
Treat sitemaps as infrastructure: generate them automatically, validate them, and monitor processing errors.

What a sitemap is (and what it isn’t)

A sitemap is a machine-readable list of URLs that helps search engines discover and re-discover content. In 2026, it’s best to think of it as:

A crawl map: “Here are the pages I care about.”
A change signal (sometimes): “These pages changed recently.” (Only true when lastmod is reliable.)
A coverage diagnostic: A stable place to compare “what I want indexed” vs “what got indexed”.

What it is not:

Not a guarantee that every URL will be crawled.
Not a guarantee that every URL will be indexed.
Not a substitute for internal linking, canonicalization, or quality content.

If your site has more than a few dozen URLs, a sitemap is table-stakes. If your site has dynamic or faceted navigation, a sitemap becomes the guardrail that keeps crawlers focused on the right URLs.

Sitemap size limits (and why they matter)

Two constraints shape almost every sitemap strategy:

Maximum 50,000 URLs per sitemap
Maximum 50MB (uncompressed) per sitemap

If you exceed either limit, you split into multiple sitemaps and optionally add a sitemap index that lists them.

Why this matters:

Large sites “accidentally” produce huge URL spaces (filters, sorts, pagination, campaigns).
Crawlers have budgets. If you dump noise into your sitemap, you burn crawl budget on pages you don’t want indexed.
Splitting by content type (blog, labs, projects) makes debugging easier because you can identify which slice is failing.

What to include vs exclude (the decision table)

Here’s the simplest “include/exclude” table that prevents most sitemap mistakes:

URL type	Include in sitemap?	Why
Canonical content pages	Yes	These are the pages you want to rank and maintain.
Pages with `noindex`	No	Mixed signals waste crawl budget and create processing noise.
Duplicate URLs (same content)	No	Canonicals should win; duplicates dilute crawl focus.
Internal search results	No	Often low-quality, infinite space, and can look spammy.
Faceted / filtered category pages	Usually no	Only include if curated and canonical (e.g., “/blog/tag/agents” as a real hub).
Paginated listing pages	Sometimes	Include only if they’re canonical and valuable (e.g., archive pages), otherwise let internal links handle discovery.
Login / account / private pages	No	Not indexable content.
Redirecting URLs	No	Engines want final destinations, not hops.
Parameterized tracking URLs (`?utm=` etc.)	No	Canonicalize and exclude to prevent duplication.

Two rules that rarely fail:

Don’t include anything you wouldn’t want as a search result.
Only include URLs that resolve to a stable, canonical 200 page.

Canonicalization: the sitemap “truth layer”

Search engines typically show canonical URLs in results. Your sitemap should reinforce that reality.

Canonical URLs only

If a page is accessible through multiple URLs (common with query params, alternative paths, or trailing-slash differences), choose one and make it canonical:

Canonical tags on the page
Internal links pointing to the canonical
Sitemap including only the canonical

If you list both the canonical and the duplicate in the sitemap, you’re telling crawlers: “I’m not sure which one matters.” That uncertainty becomes slower indexing and noisier coverage.

Multi-variant pages (mobile/desktop, language versions)

If you have multiple variants, your sitemap strategy should reflect your canonical plan:

If variants are truly separate localized pages, use proper localization annotations (and consider separate sitemaps per locale).
If variants are device-specific but equivalent, pick a canonical approach and avoid listing both unless your setup explicitly requires it.

`lastmod`, `changefreq`, `priority`: what actually helps in 2026

The sitemap protocol supports optional metadata, but search engines treat most of it as “hints” at best.

`lastmod` (use only if you can be honest)

lastmod is worth using if:

It represents the last significant update (content, structured data, meaningful links), and
It’s consistently accurate over time.

If you update lastmod every time a deploy runs (even when the page didn’t change), you teach crawlers that your lastmod is noise. When that happens, engines ignore it.

Practical definition of “significant update”:

Changes to the main content that affect meaning
Changes to structured data that affect entities
Changes to internal links that affect navigation and discovery

Not significant:

Updating a footer year
Minor formatting tweaks
Rebuilding the site without content changes

`changefreq` and `priority`

Modern search engines largely treat these as non-actionable. Use them only if your tooling generates them automatically and you’re confident you’re not introducing misleading signals. Otherwise, skip them.

Where to host the sitemap (and why root is easiest)

Host the sitemap at the site root whenever possible:

/sitemap-index.xml (index)
/sitemap-0.xml (actual sitemap file)

Root hosting makes scoping simple and prevents mistakes with directory-based limitations. It also makes it obvious in robots.txt.

Submission strategy: don’t “spray and pray”

In 2026, submission is still mostly a hint. The goal is reliability and observability.

Submit in webmaster tools

Benefits:

You get processing status and errors.
You can see discovery counts and indexing gaps.

Reference in `robots.txt`

This is the lowest-maintenance approach and helps crawlers find the sitemap automatically.

In practice, do both:

Keep the robots.txt sitemap directive.
Also submit via webmaster tooling so you can observe and debug.

How to keep sitemaps fresh as content grows

Sitemap strategy fails when sitemaps become “set and forget”.

The automation loop

Generate sitemaps automatically from your canonical routes.
Validate output on every build (XML well-formed, URL count, size).
Monitor processing errors and coverage mismatches.

Update frequency

For a content site:

Regenerate at build time (static) or on publish events (CMS).
Avoid “daily regenerate everything” if content didn’t change and your tooling can’t keep lastmod honest.

Practical implementation in Astro (static-first)

For an Astro static site, the easiest and most reliable approach is to generate sitemaps during build. That ensures:

The sitemap matches deployed routes.
The sitemap stays deterministic.

Checklist for Astro sitemap correctness:

site is set correctly (so absolute URLs are correct).
Blog posts use canonical slugs consistently.
Any non-indexable routes are excluded (private, internal tools, draft-like pages).

Troubleshooting checklist (when indexing is “weird”)

When pages aren’t showing up in search, the sitemap is a diagnostic tool, not a magic fix. Debug with this order:

Is the page indexable? (no noindex, not blocked by robots, returns 200)
Is it canonical? (canonical tag + internal links + sitemap all agree)
Is it discoverable? (internal links exist; sitemap lists it)
Is it valuable? (thin content, near-duplicates, or low-utility pages often don’t index)
Are there processing errors? (malformed XML, invalid URLs, wrong host)

Common sitemap mistakes that look like “SEO problems”:

Listing URLs that redirect
Listing parameterized URLs (duplicates)
Including non-canonical variants
Inflating lastmod for everything
Including pages blocked by robots or marked noindex

Implementation checklist

Sitemap contains only canonical, indexable URLs
No noindex pages included
No redirects included
No internal search results included
Sitemap stays under 50,000 URLs and 50MB uncompressed
Sitemap index used when split is required
lastmod used only when accurate
robots.txt includes sitemap location
Submitted to relevant webmaster tools for monitoring

FAQ

Should I split sitemaps?

Yes when you’re near the protocol limits or when you want operational clarity. Splitting by content type (blog, labs, projects) makes troubleshooting easier, and a sitemap index keeps discovery clean.

Do sitemaps improve rankings?

Not directly. They improve discovery and reduce crawl waste, which can indirectly improve indexing coverage and freshness on sites with lots of URLs.

Should I include archive pages like `/blog/archive/2`?

Include them only if they’re canonical and useful for users (not thin). Otherwise, let internal links handle discovery. For large content sites, a paginated archive is often worth indexing if it’s well-structured and not duplicative.

Should I use `lastmod`?

Only if it’s consistently accurate. If you can’t reliably compute “last significant content change,” omit it rather than generating noise.

What’s the simplest sitemap strategy that works?

Generate an XML sitemap during build, include canonical URLs only, keep it under limits, reference it from robots.txt, and monitor processing in webmaster tools.

Sources & further reading

Interested in our research?

We share our work openly. If you'd like to collaborate or discuss ideas — we'd love to hear from you.

Get in Touch

Sitemap Strategy in 2026: What to Include, Exclude, and Why

TL;DR

What a sitemap is (and what it isn’t)

Sitemap size limits (and why they matter)

What to include vs exclude (the decision table)

Canonicalization: the sitemap “truth layer”

Canonical URLs only

Multi-variant pages (mobile/desktop, language versions)

`lastmod`, `changefreq`, `priority`: what actually helps in 2026

`lastmod` (use only if you can be honest)

`changefreq` and `priority`

Where to host the sitemap (and why root is easiest)

Submission strategy: don’t “spray and pray”

Submit in webmaster tools

Reference in `robots.txt`

How to keep sitemaps fresh as content grows

The automation loop

Update frequency

Practical implementation in Astro (static-first)

Troubleshooting checklist (when indexing is “weird”)

Implementation checklist

FAQ

Should I split sitemaps?

Do sitemaps improve rankings?

Should I include archive pages like `/blog/archive/2`?

Should I use `lastmod`?

What’s the simplest sitemap strategy that works?

Sources & further reading

Interested in our research?

More Articles

Agent Economics in 2026: Cost, Latency, and the Business Model

Agentic Workflow Design in 2026: How to Turn Automation Into Outcomes

Agent Routing Strategies in 2026: The Router Is the Product

Let's build
something real.

Sitemap Strategy in 2026: What to Include, Exclude, and Why

TL;DR

What a sitemap is (and what it isn’t)

Sitemap size limits (and why they matter)

What to include vs exclude (the decision table)

Canonicalization: the sitemap “truth layer”

Canonical URLs only

Multi-variant pages (mobile/desktop, language versions)

lastmod, changefreq, priority: what actually helps in 2026

lastmod (use only if you can be honest)

changefreq and priority

Where to host the sitemap (and why root is easiest)

Submission strategy: don’t “spray and pray”

Submit in webmaster tools

Reference in robots.txt

How to keep sitemaps fresh as content grows

The automation loop

Update frequency

Practical implementation in Astro (static-first)

Troubleshooting checklist (when indexing is “weird”)

Implementation checklist

FAQ

Should I split sitemaps?

Do sitemaps improve rankings?

Should I include archive pages like /blog/archive/2?

Should I use lastmod?

What’s the simplest sitemap strategy that works?

Sources & further reading

Interested in our research?

More Articles

Agent Economics in 2026: Cost, Latency, and the Business Model

Agentic Workflow Design in 2026: How to Turn Automation Into Outcomes

Agent Routing Strategies in 2026: The Router Is the Product

Let's build something real.

`lastmod`, `changefreq`, `priority`: what actually helps in 2026

`lastmod` (use only if you can be honest)

`changefreq` and `priority`

Reference in `robots.txt`

Should I include archive pages like `/blog/archive/2`?

Should I use `lastmod`?

Let's build
something real.