fix(docs-mcp): recursively crawl and register nested llms.txt resources#2317
fix(docs-mcp): recursively crawl and register nested llms.txt resources#2317
Conversation
🦋 Changeset detectedLatest commit: 6877936 The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughConverts the docs MCP server resource registrar into an async crawler that recursively discovers and registers nested Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
packages/mcp-servers/docs-mcp-server/main.ts (1)
122-130: Consider making the resource factory consistent with non-nested resources.The factory here returns cached
nestedMarkdowncaptured at registration time, while regular resources (lines 157-171) use an async factory that fetches fresh content on each read. This creates behavioral inconsistency: nested index resources return startup-time content, while other resources reflect current server content.If this caching is intentional (avoiding redundant fetches for stable index files), consider adding a brief comment to document the design decision.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/mcp-servers/docs-mcp-server/main.ts` around lines 122 - 130, The resource factory currently returns the cached nestedMarkdown captured at registration (the factory returning () => ({ contents: [{ uri: `lynx-docs://${strippedUrl}`, text: nestedMarkdown, mimeType: 'text/markdown' }] })), which is inconsistent with the other resource factories that are async and fetch fresh content on each read; either change this factory to an async factory that computes/fetches the current nested markdown on each invocation (e.g., async () => ({ contents: [{ uri: `lynx-docs://${strippedUrl}`, text: await computeNestedMarkdown(...), mimeType: 'text/markdown' }] })) to match the behavior of the resources at lines 157-171, or if startup caching is intentional, add a short comment above this factory referencing nestedMarkdown and explaining that it is intentionally captured at registration to avoid repeated fetches.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@packages/mcp-servers/docs-mcp-server/main.ts`:
- Around line 122-130: The resource factory currently returns the cached
nestedMarkdown captured at registration (the factory returning () => ({
contents: [{ uri: `lynx-docs://${strippedUrl}`, text: nestedMarkdown, mimeType:
'text/markdown' }] })), which is inconsistent with the other resource factories
that are async and fetch fresh content on each read; either change this factory
to an async factory that computes/fetches the current nested markdown on each
invocation (e.g., async () => ({ contents: [{ uri: `lynx-docs://${strippedUrl}`,
text: await computeNestedMarkdown(...), mimeType: 'text/markdown' }] })) to
match the behavior of the resources at lines 157-171, or if startup caching is
intentional, add a short comment above this factory referencing nestedMarkdown
and explaining that it is intentionally captured at registration to avoid
repeated fetches.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e04dd5dd-f09e-45c9-80f1-5621108d9111
📒 Files selected for processing (2)
.changeset/fix-recursive-docs-mcp.mdpackages/mcp-servers/docs-mcp-server/main.ts
There was a problem hiding this comment.
Pull request overview
This PR updates the docs MCP server to recursively discover and register resources referenced by nested llms.txt index files, preventing “Resource not found” errors when documentation is organized under sub-indexes.
Changes:
- Implement recursive crawling of
llms.txtlinks to register nested indexes and their referenced resources. - Add HTTP status handling for nested index fetches and for resource fetches during reads.
- Add a changeset to publish a patch release for
@lynx-js/docs-mcp-server.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| packages/mcp-servers/docs-mcp-server/main.ts | Adds recursive crawling/registration of nested llms.txt resources and improves fetch error handling. |
| .changeset/fix-recursive-docs-mcp.md | Patch changeset entry for the docs MCP server. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (strippedUrl.endsWith('llms.txt')) { | ||
| if (visited.has(strippedUrl)) { | ||
| continue; | ||
| } | ||
|
|
||
| debug(`Recursively fetching index: ${link.url}`); | ||
| try { | ||
| const response = await fetch(link.url); | ||
| if (!response.ok) { | ||
| debug(`Failed to fetch nested index ${link.url}: ${response.status} ${response.statusText}`); | ||
| continue; | ||
| } | ||
| const nestedMarkdown = await response.text(); | ||
| visited.add(strippedUrl); | ||
|
|
There was a problem hiding this comment.
In the recursive llms.txt branch, visited is only updated after a successful fetch+read. If the same nested index appears in multiple index files and consistently fails (404/network), the server will re-fetch it for every occurrence, potentially causing long startup times and noisy logs. Consider adding the strippedUrl to visited before attempting the fetch (or tracking a separate failed/inProgress set) so each nested index is attempted at most once per startup run.
| async () => { | ||
| const response = await fetch(link.url); | ||
| if (!response.ok) { | ||
| throw new Error(`Failed to fetch resource ${link.url}: ${response.status} ${response.statusText}`); |
There was a problem hiding this comment.
There’s an extra leading space before throw new Error(...) which makes the indentation inconsistent with the surrounding block and may fail formatting checks (e.g., Prettier). Align this line’s indentation with the rest of the block.
| throw new Error(`Failed to fetch resource ${link.url}: ${response.status} ${response.statusText}`); | |
| throw new Error(`Failed to fetch resource ${link.url}: ${response.status} ${response.statusText}`); |
❌ 1 Tests Failed:
View the top 3 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
Web Explorer#8807 Bundle Size — 748.66KiB (0%).6877936(current) vs cffd86f main#8805(baseline) Bundle metrics
|
| Current #8807 |
Baseline #8805 |
|
|---|---|---|
44.27KiB |
44.27KiB |
|
2.16KiB |
2.16KiB |
|
0% |
0% |
|
8 |
8 |
|
10 |
10 |
|
149 |
149 |
|
11 |
11 |
|
35.01% |
35.01% |
|
3 |
3 |
|
0 |
0 |
Bundle size by type no changes
| Current #8807 |
Baseline #8805 |
|
|---|---|---|
401.63KiB |
401.63KiB |
|
344.87KiB |
344.87KiB |
|
2.16KiB |
2.16KiB |
Bundle analysis report Branch fix/docs-mcp-recursion Project dashboard
Generated by RelativeCI Documentation Report issue
25e7b16 to
80a7953
Compare
80a7953 to
7b4ac27
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/mcp-servers/docs-mcp-server/main.ts`:
- Around line 49-54: The crawler in crawlAndRegisterResources and related blocks
resolves nested link.url against the original root baseURL instead of the
current llms.txt location, so relative links like ./foo.md or ../bar/llms.txt
break; fix by resolving each link against the current resource's URL before
using or recursing: compute a resolved URL using new URL(link.url,
currentResourceBase) where currentResourceBase is the URL of the llms.txt (or
the full URL you just fetched/parsed) rather than the passed-in root baseURL,
use that resolved URL for fetching/registering and pass its origin/path (or the
resolved URL) as the base for recursive calls (update occurrences in
crawlAndRegisterResources and the other blocks mentioned: 99-138, 157-171,
222-228).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 2b2081ec-7d5c-4d04-8e48-15dd4bbf79a3
📒 Files selected for processing (2)
.changeset/fix-recursive-docs-mcp.mdpackages/mcp-servers/docs-mcp-server/main.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- .changeset/fix-recursive-docs-mcp.md
| async function crawlAndRegisterResources( | ||
| baseURL: string, | ||
| mcpServer: McpServer, | ||
| fromMarkdownText: string, | ||
| visited: Set<string> = new Set(), | ||
| ) { |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
BASE_URL="${BASE_URL:-https://lynxjs.org/next}"
ROOT_URL="${BASE_URL%/}/llms.txt"
python - <<'PY' "$ROOT_URL"
import re
import sys
from urllib.parse import urljoin, urlparse
from urllib.request import urlopen
root_url = sys.argv[1]
def read(url: str) -> str:
with urlopen(url) as resp:
return resp.read().decode("utf-8", errors="replace")
def markdown_links(md: str):
return re.findall(r'\[[^\]]+\]\(([^)]+)\)', md)
root_md = read(root_url)
nested_indexes = []
for href in markdown_links(root_md):
absolute = urljoin(root_url, href)
if urlparse(absolute).path.endswith("llms.txt"):
nested_indexes.append(absolute)
print(f"Root index: {root_url}")
print(f"Nested indexes found: {len(nested_indexes)}")
for nested_url in nested_indexes:
nested_md = read(nested_url)
relative_links = []
for href in markdown_links(nested_md):
parsed = urlparse(href)
if not parsed.scheme and not href.startswith("//"):
relative_links.append(href)
print(f"\n{nested_url}")
print(f" relative links: {len(relative_links)}")
for href in relative_links[:10]:
print(f" - {href}")
PYRepository: lynx-family/lynx-stack
Length of output: 2682
🏁 Script executed:
head -c 0 /dev/null # test
wc -l packages/mcp-servers/docs-mcp-server/main.tsRepository: lynx-family/lynx-stack
Length of output: 114
🏁 Script executed:
sed -n '49,54p' packages/mcp-servers/docs-mcp-server/main.ts
sed -n '99,138p' packages/mcp-servers/docs-mcp-server/main.ts
sed -n '157,171p' packages/mcp-servers/docs-mcp-server/main.ts
sed -n '222,228p' packages/mcp-servers/docs-mcp-server/main.tsRepository: lynx-family/lynx-stack
Length of output: 1892
Resolve nested links against the current llms.txt, not the root URL.
After the first recursion level, the crawler still consumes link.url as-is and keeps passing baseUrl forward. If a nested index contains relative links (./foo.md, ../bar/llms.txt), those links will never resolve correctly, so the nested resources still won't be registered/read.
💡 Proposed fix
async function crawlAndRegisterResources(
baseURL: string,
+ currentIndexUrl: string,
mcpServer: McpServer,
fromMarkdownText: string,
visited: Set<string> = new Set(),
) {
@@
- const u = new URL(link.url);
+ const u = new URL(link.url, currentIndexUrl);
@@
- const response = await fetch(link.url);
+ const nestedIndexUrl = new URL(link.url, currentIndexUrl);
+ const response = await fetch(nestedIndexUrl);
@@
await crawlAndRegisterResources(
baseURL,
+ nestedIndexUrl.href,
mcpServer,
nestedMarkdown,
visited,
);
@@
- const response = await fetch(link.url);
+ const resourceUrl = new URL(link.url, currentIndexUrl);
+ const response = await fetch(resourceUrl);
if (!response.ok) {
- throw new Error(`Failed to fetch resource ${link.url}: ${response.status} ${response.statusText}`);
+ throw new Error(`Failed to fetch resource ${resourceUrl}: ${response.status} ${response.statusText}`);
}
@@
await crawlAndRegisterResources(
baseUrl,
+ ROOT_DOC_URL,
mcpServer,
ROOT_DOC_MARKDOWN,
visited,
);Also applies to: 99-138, 157-171, 222-228
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/mcp-servers/docs-mcp-server/main.ts` around lines 49 - 54, The
crawler in crawlAndRegisterResources and related blocks resolves nested link.url
against the original root baseURL instead of the current llms.txt location, so
relative links like ./foo.md or ../bar/llms.txt break; fix by resolving each
link against the current resource's URL before using or recursing: compute a
resolved URL using new URL(link.url, currentResourceBase) where
currentResourceBase is the URL of the llms.txt (or the full URL you just
fetched/parsed) rather than the passed-in root baseURL, use that resolved URL
for fetching/registering and pass its origin/path (or the resolved URL) as the
base for recursive calls (update occurrences in crawlAndRegisterResources and
the other blocks mentioned: 99-138, 157-171, 222-228).
Merging this PR will degrade performance by 14.57%
Performance Changes
Comparing Footnotes
|
React MTF Example#366 Bundle Size — 206.12KiB (0%).6877936(current) vs cffd86f main#364(baseline) Bundle metrics
|
| Current #366 |
Baseline #364 |
|
|---|---|---|
0B |
0B |
|
0B |
0B |
|
0% |
0% |
|
0 |
0 |
|
3 |
3 |
|
173 |
173 |
|
67 |
67 |
|
45.79% |
45.79% |
|
2 |
2 |
|
0 |
0 |
Bundle size by type no changes
| Current #366 |
Baseline #364 |
|
|---|---|---|
111.23KiB |
111.23KiB |
|
94.89KiB |
94.89KiB |
Bundle analysis report Branch fix/docs-mcp-recursion Project dashboard
Generated by RelativeCI Documentation Report issue
React External#351 Bundle Size — 591.76KiB (0%).6877936(current) vs cffd86f main#349(baseline) Bundle metrics
|
| Current #351 |
Baseline #349 |
|
|---|---|---|
0B |
0B |
|
0B |
0B |
|
0% |
30.88% |
|
0 |
0 |
|
3 |
3 |
|
17 |
17 |
|
5 |
5 |
|
8.59% |
8.59% |
|
0 |
0 |
|
0 |
0 |
Bundle analysis report Branch fix/docs-mcp-recursion Project dashboard
Generated by RelativeCI Documentation Report issue
React Example#7233 Bundle Size — 236.83KiB (0%).6877936(current) vs cffd86f main#7231(baseline) Bundle metrics
|
| Current #7233 |
Baseline #7231 |
|
|---|---|---|
0B |
0B |
|
0B |
0B |
|
0% |
0% |
|
0 |
0 |
|
4 |
4 |
|
179 |
179 |
|
70 |
70 |
|
46.13% |
46.13% |
|
2 |
2 |
|
0 |
0 |
Bundle size by type no changes
| Current #7233 |
Baseline #7231 |
|
|---|---|---|
145.76KiB |
145.76KiB |
|
91.07KiB |
91.07KiB |
Bundle analysis report Branch fix/docs-mcp-recursion Project dashboard
Generated by RelativeCI Documentation Report issue
This PR fixes an issue where nested documentation resources (linked from sub-indexes like
api/llms.txt) were not being registered by the MCP server, causing 'Resource not found' errors.Changes:
llms.txtfiles inmain.ts.Summary by CodeRabbit
Bug Fixes
Chores