Sitemap Strategy in 2026: What to Include, Exclude, and Why
Sitemaps are a crawl hint and an indexing map. A practical guide to building and maintaining sitemaps for content-heavy sites.
TL;DR
- A sitemap is a discovery + crawl hint, not a ranking hack.
- Include canonical, indexable URLs you actually want to show in search.
- Keep one sitemap under 50,000 URLs and 50MB (uncompressed); use a sitemap index for more.
lastmodonly helps if it’s consistently accurate;priority/changefreqare effectively ignored by modern engines.- Treat sitemaps as infrastructure: generate them automatically, validate them, and monitor processing errors.
What a sitemap is (and what it isn’t)
A sitemap is a machine-readable list of URLs that helps search engines discover and re-discover content. In 2026, it’s best to think of it as:
- A crawl map: “Here are the pages I care about.”
- A change signal (sometimes): “These pages changed recently.” (Only true when
lastmodis reliable.) - A coverage diagnostic: A stable place to compare “what I want indexed” vs “what got indexed”.
What it is not:
- Not a guarantee that every URL will be crawled.
- Not a guarantee that every URL will be indexed.
- Not a substitute for internal linking, canonicalization, or quality content.
If your site has more than a few dozen URLs, a sitemap is table-stakes. If your site has dynamic or faceted navigation, a sitemap becomes the guardrail that keeps crawlers focused on the right URLs.
Sitemap size limits (and why they matter)
Two constraints shape almost every sitemap strategy:
- Maximum 50,000 URLs per sitemap
- Maximum 50MB (uncompressed) per sitemap
If you exceed either limit, you split into multiple sitemaps and optionally add a sitemap index that lists them.
Why this matters:
- Large sites “accidentally” produce huge URL spaces (filters, sorts, pagination, campaigns).
- Crawlers have budgets. If you dump noise into your sitemap, you burn crawl budget on pages you don’t want indexed.
- Splitting by content type (blog, labs, projects) makes debugging easier because you can identify which slice is failing.
What to include vs exclude (the decision table)
Here’s the simplest “include/exclude” table that prevents most sitemap mistakes:
| URL type | Include in sitemap? | Why |
|---|---|---|
| Canonical content pages | Yes | These are the pages you want to rank and maintain. |
Pages with noindex | No | Mixed signals waste crawl budget and create processing noise. |
| Duplicate URLs (same content) | No | Canonicals should win; duplicates dilute crawl focus. |
| Internal search results | No | Often low-quality, infinite space, and can look spammy. |
| Faceted / filtered category pages | Usually no | Only include if curated and canonical (e.g., “/blog/tag/agents” as a real hub). |
| Paginated listing pages | Sometimes | Include only if they’re canonical and valuable (e.g., archive pages), otherwise let internal links handle discovery. |
| Login / account / private pages | No | Not indexable content. |
| Redirecting URLs | No | Engines want final destinations, not hops. |
Parameterized tracking URLs (?utm= etc.) | No | Canonicalize and exclude to prevent duplication. |
Two rules that rarely fail:
- Don’t include anything you wouldn’t want as a search result.
- Only include URLs that resolve to a stable, canonical 200 page.
Canonicalization: the sitemap “truth layer”
Search engines typically show canonical URLs in results. Your sitemap should reinforce that reality.
Canonical URLs only
If a page is accessible through multiple URLs (common with query params, alternative paths, or trailing-slash differences), choose one and make it canonical:
- Canonical tags on the page
- Internal links pointing to the canonical
- Sitemap including only the canonical
If you list both the canonical and the duplicate in the sitemap, you’re telling crawlers: “I’m not sure which one matters.” That uncertainty becomes slower indexing and noisier coverage.
Multi-variant pages (mobile/desktop, language versions)
If you have multiple variants, your sitemap strategy should reflect your canonical plan:
- If variants are truly separate localized pages, use proper localization annotations (and consider separate sitemaps per locale).
- If variants are device-specific but equivalent, pick a canonical approach and avoid listing both unless your setup explicitly requires it.
lastmod, changefreq, priority: what actually helps in 2026
The sitemap protocol supports optional metadata, but search engines treat most of it as “hints” at best.
lastmod (use only if you can be honest)
lastmod is worth using if:
- It represents the last significant update (content, structured data, meaningful links), and
- It’s consistently accurate over time.
If you update lastmod every time a deploy runs (even when the page didn’t change), you teach crawlers that your lastmod is noise. When that happens, engines ignore it.
Practical definition of “significant update”:
- Changes to the main content that affect meaning
- Changes to structured data that affect entities
- Changes to internal links that affect navigation and discovery
Not significant:
- Updating a footer year
- Minor formatting tweaks
- Rebuilding the site without content changes
changefreq and priority
Modern search engines largely treat these as non-actionable. Use them only if your tooling generates them automatically and you’re confident you’re not introducing misleading signals. Otherwise, skip them.
Where to host the sitemap (and why root is easiest)
Host the sitemap at the site root whenever possible:
/sitemap-index.xml(index)/sitemap-0.xml(actual sitemap file)
Root hosting makes scoping simple and prevents mistakes with directory-based limitations. It also makes it obvious in robots.txt.
Submission strategy: don’t “spray and pray”
In 2026, submission is still mostly a hint. The goal is reliability and observability.
Submit in webmaster tools
Benefits:
- You get processing status and errors.
- You can see discovery counts and indexing gaps.
Reference in robots.txt
This is the lowest-maintenance approach and helps crawlers find the sitemap automatically.
In practice, do both:
- Keep the
robots.txtsitemap directive. - Also submit via webmaster tooling so you can observe and debug.
How to keep sitemaps fresh as content grows
Sitemap strategy fails when sitemaps become “set and forget”.
The automation loop
- Generate sitemaps automatically from your canonical routes.
- Validate output on every build (XML well-formed, URL count, size).
- Monitor processing errors and coverage mismatches.
Update frequency
For a content site:
- Regenerate at build time (static) or on publish events (CMS).
- Avoid “daily regenerate everything” if content didn’t change and your tooling can’t keep
lastmodhonest.
Practical implementation in Astro (static-first)
For an Astro static site, the easiest and most reliable approach is to generate sitemaps during build. That ensures:
- The sitemap matches deployed routes.
- The sitemap stays deterministic.
Checklist for Astro sitemap correctness:
siteis set correctly (so absolute URLs are correct).- Blog posts use canonical slugs consistently.
- Any non-indexable routes are excluded (private, internal tools, draft-like pages).
Troubleshooting checklist (when indexing is “weird”)
When pages aren’t showing up in search, the sitemap is a diagnostic tool, not a magic fix. Debug with this order:
- Is the page indexable? (no
noindex, not blocked by robots, returns 200) - Is it canonical? (canonical tag + internal links + sitemap all agree)
- Is it discoverable? (internal links exist; sitemap lists it)
- Is it valuable? (thin content, near-duplicates, or low-utility pages often don’t index)
- Are there processing errors? (malformed XML, invalid URLs, wrong host)
Common sitemap mistakes that look like “SEO problems”:
- Listing URLs that redirect
- Listing parameterized URLs (duplicates)
- Including non-canonical variants
- Inflating
lastmodfor everything - Including pages blocked by robots or marked
noindex
Implementation checklist
- Sitemap contains only canonical, indexable URLs
- No
noindexpages included - No redirects included
- No internal search results included
- Sitemap stays under 50,000 URLs and 50MB uncompressed
- Sitemap index used when split is required
-
lastmodused only when accurate -
robots.txtincludes sitemap location - Submitted to relevant webmaster tools for monitoring
FAQ
Should I split sitemaps?
Yes when you’re near the protocol limits or when you want operational clarity. Splitting by content type (blog, labs, projects) makes troubleshooting easier, and a sitemap index keeps discovery clean.
Do sitemaps improve rankings?
Not directly. They improve discovery and reduce crawl waste, which can indirectly improve indexing coverage and freshness on sites with lots of URLs.
Should I include archive pages like /blog/archive/2?
Include them only if they’re canonical and useful for users (not thin). Otherwise, let internal links handle discovery. For large content sites, a paginated archive is often worth indexing if it’s well-structured and not duplicative.
Should I use lastmod?
Only if it’s consistently accurate. If you can’t reliably compute “last significant content change,” omit it rather than generating noise.
What’s the simplest sitemap strategy that works?
Generate an XML sitemap during build, include canonical URLs only, keep it under limits, reference it from robots.txt, and monitor processing in webmaster tools.
Sources & further reading
Interested in our research?
We share our work openly. If you'd like to collaborate or discuss ideas — we'd love to hear from you.
Get in Touch