-
Notifications
You must be signed in to change notification settings - Fork 5.6k
Open
Labels
Description
Summary
Currently, the RobotsTxtSettings.DisallowLanguages
setting allows administrators to exclude specific languages from being indexed by robots. However, this exclusion is only applied in robots.txt
, not in sitemap.xml
.
As a result, languages disallowed in robots.txt
can still appear in sitemap entries (including <xhtml:link rel="alternate" hreflang="...">
), which may send conflicting signals to crawlers.
💡 Real-world use case
Imagine an e-commerce store that:
- Has a language version
/de
for testing or development purposes - Blocks
/de
inrobots.txt
to prevent indexing - Still sees
/de
alternate URLs insitemap.xml
This can cause:
- Google Search Console warnings
- Indexing of alternate versions that should not be public
- Duplicate content penalties across language versions
✅ Recommendation
Introduce a new setting:
SitemapXmlSettings.DisallowLanguages
This would allow fine-grained control over which language versions are included in the sitemap, independently of what is excluded from robots.txt
.
We should not reuse RobotsTxtSettings.DisallowLanguages
, because:
- It may be desirable to exclude a language from sitemap but still allow crawlers to access it (e.g., noindex SEO experiments)
- The two systems have separate behavior and timing
📦 Affected areas
SitemapModelFactory
SitemapXmlSettings
💪 Next step
I'd be happy to prepare a PR that introduces:
- A new setting
DisallowLanguages
inSitemapXmlSettings
- Updates to
SitemapModelFactory
to respect this list when generating entries and alternate<xhtml:link>
references