Skip to content

Commit 04814ef

Browse files
authored
Verify links when notebooks are changed (#426)
## Problem Some notebooks contain broken links ## Solution When someone is working on a notebook, we want to ensure all links are valid. Parse out the links, then do a `HEAD` request to see if they are valid. ## Type of Change - [x] Infrastructure change (CI configs, etc)
1 parent a54a7c0 commit 04814ef

File tree

4 files changed

+143
-5
lines changed

4 files changed

+143
-5
lines changed
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
name: "Check Links in Notebook"
2+
description: "Check Links in Notebook"
3+
4+
inputs:
5+
notebook:
6+
description: "The notebook to check for broken links"
7+
required: true
8+
9+
runs:
10+
using: 'composite'
11+
steps:
12+
- name: Set up Python
13+
uses: actions/setup-python@v5
14+
with:
15+
python-version: '3.11'
16+
17+
- name: Install dependencies
18+
shell: bash
19+
run: |
20+
pip install --upgrade pip
21+
pip install nbformat requests
22+
23+
- id: convert
24+
shell: bash
25+
name: Check links in notebook
26+
run: |
27+
python .github/actions/check-links/check-links.py ${{ inputs.notebook }}
Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
#! /usr/bin/env python
2+
3+
# Check links in a notebook
4+
5+
import re
6+
import os
7+
import sys
8+
import json
9+
import nbformat
10+
import requests
11+
12+
# Get the notebook filename from the command line
13+
filename = "../../../" + sys.argv[1]
14+
print(f"Processing notebook: {filename}")
15+
nb_source_path = os.path.join(os.path.dirname(__file__), filename)
16+
17+
known_good = [
18+
"https://www.pinecone.io",
19+
"https://app.pinecone.io",
20+
]
21+
known_good_links = set(known_good)
22+
for link in known_good:
23+
known_good_links.add(f"{link}/")
24+
25+
# Read the notebook
26+
with open(nb_source_path, "r", encoding="utf-8") as f:
27+
nb = nbformat.read(f, as_version=4)
28+
29+
try:
30+
good_links = set()
31+
failed_links = set()
32+
links = set() # Use set to avoid duplicates
33+
34+
# URL regex pattern - updated to handle markdown links better
35+
url_pattern = r'https?://[^\s<>"\)]+|www\.[^\s<>"\)]+'
36+
37+
# Search through all cells
38+
for cell in nb['cells']:
39+
if 'source' in cell:
40+
# Join multi-line source into single string
41+
content = ''.join(cell['source'])
42+
# Find all URLs
43+
found_links = re.findall(url_pattern, content)
44+
links.update(found_links)
45+
46+
if links:
47+
print(f"\nFile: {filename}")
48+
for link in sorted(links):
49+
if link in known_good_links:
50+
good_links.add(link)
51+
print(f" ✅ {link}")
52+
continue
53+
elif link in good_links:
54+
continue
55+
elif link in failed_links:
56+
continue
57+
else:
58+
try:
59+
response = requests.head(link, timeout=10)
60+
if response.status_code == 405:
61+
# Not all links can be checked with HEAD, so we fall back to GET
62+
response = requests.get(link, timeout=10)
63+
64+
if response.status_code == 200:
65+
good_links.add(link)
66+
print(f" ✅ {link}")
67+
else:
68+
failed_links.add(link)
69+
print(f" ❌ {response.status_code} {link}")
70+
except Exception as e:
71+
failed_links.add(link)
72+
print(f" ❌ {link}")
73+
74+
print(f"Found {len(links)} links")
75+
print(f"Good links: {len(good_links)}")
76+
77+
if len(failed_links) > 0:
78+
print("Failed links:")
79+
for link in sorted(failed_links):
80+
print(f" ❌ {link}")
81+
sys.exit(1)
82+
else:
83+
print("No bad links found")
84+
sys.exit(0)
85+
86+
except json.JSONDecodeError:
87+
print(f"Error: Could not parse {filename} as JSON")
88+
except KeyError:
89+
print(f"Error: Unexpected notebook format in {filename}")

.github/workflows/test-notebooks-changed.yaml

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,3 +60,18 @@ jobs:
6060
notebook: ${{ matrix.notebook }}
6161
PINECONE_API_KEY: ${{ secrets.PINECONE_API_KEY }}
6262
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
63+
64+
check-links:
65+
needs:
66+
- validate-notebooks
67+
- detect-changes
68+
if: needs.detect-changes.outputs.has_changes == 'true'
69+
runs-on: ubuntu-latest
70+
strategy:
71+
fail-fast: false
72+
matrix: ${{ fromJSON(needs.detect-changes.outputs.matrix) }}
73+
steps:
74+
- uses: actions/checkout@v4
75+
- uses: ./.github/actions/check-links
76+
with:
77+
notebook: ${{ matrix.notebook }}

docs/langchain-retrieval-agent.ipynb

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -17,21 +17,28 @@
1717
"id": "bhWwrfbbVGOA"
1818
},
1919
"source": [
20-
"#### [LangChain Handbook](https://pinecone.io/learn/langchain)\n",
20+
"#### [LangChain Handbook](https://www.pinecone.io/learn/series/langchain/)\n",
2121
"\n",
2222
"# Retrieval Agents\n",
2323
"\n",
24-
"We've seen in previous chapters how powerful [retrieval augmentation](https://www.pinecone.io/learn/langchain-retrieval-augmentation/) and [conversational agents](https://www.pinecone.io/learn/langchain-agents/) can be. They become even more impressive when we begin using them together.\n",
24+
"We've seen in previous chapters how powerful [retrieval augmentation](https://www.pinecone.io/learn/series/langchain/langchain-retrieval-augmentation/) and [conversational agents](https://www.pinecone.io/learn/series/langchain/langchain-agents/) can be. They become even more impressive when we begin using them together.\n",
2525
"\n",
2626
"Conversational agents can struggle with data freshness, knowledge about specific domains, or accessing internal documentation. By coupling agents with retrieval augmentation tools we no longer have these problems.\n",
2727
"\n",
2828
"One the other side, using \"naive\" retrieval augmentation without the use of an agent means we will retrieve contexts with *every* query. Again, this isn't always ideal as not every query requires access to external knowledge.\n",
2929
"\n",
3030
"Merging these methods gives us the best of both worlds. In this notebook we'll learn how to do this.\n",
3131
"\n",
32-
"[![Open full notebook](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/full-link.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/generation/langchain/handbook/08-langchain-retrieval-agent.ipynb)\n",
32+
"[![Open full notebook](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/full-link.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/generation/langchain/handbook/08-langchain-retrieval-agent.ipynb)\n"
33+
]
34+
},
35+
{
36+
"cell_type": "markdown",
37+
"metadata": {},
38+
"source": [
39+
"# Prerequisites\n",
3340
"\n",
34-
"To begin, we must install the prerequisite libraries that we will be using in this notebook."
41+
"To begin, we must install several libraries that we will be using in this notebook."
3542
]
3643
},
3744
{
@@ -415,7 +422,7 @@
415422
"- `name` can be anything we like. The name is used as an identifier for the index when performing other operations such as `describe_index`, `delete_index`, and so on. \n",
416423
"- `metric` specifies the similarity metric that will be used later when you make queries to the index.\n",
417424
"- `dimension` should correspond to the dimension of the dense vectors produced by your embedding model. In this quick start, we are using made-up data so a small value is simplest.\n",
418-
"- `spec` holds a specification which tells Pinecone how you would like to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).\n",
425+
"- `spec` holds a specification which tells Pinecone how you would like to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/troubleshooting/available-cloud-regions).\n",
419426
"\n",
420427
"There are more configurations available, but this minimal set will get us started."
421428
]

0 commit comments

Comments
 (0)