Skip to content

Support per-language exclusions in sitemap.xml generation #7777

@bambuca

Description

@bambuca

Summary

Currently, the RobotsTxtSettings.DisallowLanguages setting allows administrators to exclude specific languages from being indexed by robots. However, this exclusion is only applied in robots.txt, not in sitemap.xml.

As a result, languages disallowed in robots.txt can still appear in sitemap entries (including <xhtml:link rel="alternate" hreflang="...">), which may send conflicting signals to crawlers.


💡 Real-world use case

Imagine an e-commerce store that:

  • Has a language version /de for testing or development purposes
  • Blocks /de in robots.txt to prevent indexing
  • Still sees /de alternate URLs in sitemap.xml

This can cause:

  • Google Search Console warnings
  • Indexing of alternate versions that should not be public
  • Duplicate content penalties across language versions

✅ Recommendation

Introduce a new setting:

SitemapXmlSettings.DisallowLanguages

This would allow fine-grained control over which language versions are included in the sitemap, independently of what is excluded from robots.txt.

We should not reuse RobotsTxtSettings.DisallowLanguages, because:

  • It may be desirable to exclude a language from sitemap but still allow crawlers to access it (e.g., noindex SEO experiments)
  • The two systems have separate behavior and timing

📦 Affected areas

  • SitemapModelFactory
  • SitemapXmlSettings

💪 Next step

I'd be happy to prepare a PR that introduces:

  • A new setting DisallowLanguages in SitemapXmlSettings
  • Updates to SitemapModelFactory to respect this list when generating entries and alternate <xhtml:link> references

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions