Verify links when notebooks are changed (#426)

jhamon · web-flow · commit 04814efc60fd · 2025-03-24T13:58:21.000-04:00
## Problem

Some notebooks contain broken links

## Solution

When someone is working on a notebook, we want to ensure all links are
valid. Parse out the links, then do a `HEAD` request to see if they are
valid.

## Type of Change

- [x] Infrastructure change (CI configs, etc)
diff --git a/.github/actions/check-links/action.yml b/.github/actions/check-links/action.yml
@@ -0,0 +1,27 @@
+name: "Check Links in Notebook"
+description: "Check Links in Notebook"
+
+inputs:
+  notebook:
+    description: "The notebook to check for broken links"
+    required: true
+
+runs:
+  using: 'composite'
+  steps:
+    - name: Set up Python
+      uses: actions/setup-python@v5
+      with:
+        python-version: '3.11'
+
+    - name: Install dependencies
+      shell: bash
+      run: |
+        pip install --upgrade pip
+        pip install nbformat requests
+
+    - id: convert
+      shell: bash
+      name: Check links in notebook
+      run: |
+        python .github/actions/check-links/check-links.py ${{ inputs.notebook }}
diff --git a/.github/actions/check-links/check-links.py b/.github/actions/check-links/check-links.py
@@ -0,0 +1,89 @@
+#! /usr/bin/env python
+
+# Check links in a notebook
+
+import re
+import os
+import sys
+import json
+import nbformat
+import requests
+
+# Get the notebook filename from the command line
+filename = "../../../" + sys.argv[1]
+print(f"Processing notebook: {filename}")
+nb_source_path = os.path.join(os.path.dirname(__file__), filename)
+
+known_good = [
+    "https://www.pinecone.io",
+    "https://app.pinecone.io",
+]
+known_good_links = set(known_good)
+for link in known_good:
+    known_good_links.add(f"{link}/")
+
+# Read the notebook
+with open(nb_source_path, "r", encoding="utf-8") as f:
+    nb = nbformat.read(f, as_version=4)
+
+    try:
+        good_links = set()
+        failed_links = set()
+        links = set()  # Use set to avoid duplicates
+        
+        # URL regex pattern - updated to handle markdown links better
+        url_pattern = r'https?://[^\s<>"\)]+|www\.[^\s<>"\)]+'
+        
+        # Search through all cells
+        for cell in nb['cells']:
+            if 'source' in cell:
+                # Join multi-line source into single string
+                content = ''.join(cell['source'])
+                # Find all URLs
+                found_links = re.findall(url_pattern, content)
+                links.update(found_links)
+        
+        if links:
+            print(f"\nFile: {filename}")
+            for link in sorted(links):
+                if link in known_good_links:
+                    good_links.add(link)
+                    print(f"  ✅ {link}")
+                    continue
+                elif link in good_links:
+                    continue
+                elif link in failed_links:
+                    continue
+                else:
+                    try:
+                        response = requests.head(link, timeout=10)
+                        if response.status_code == 405:
+                            # Not all links can be checked with HEAD, so we fall back to GET
+                            response = requests.get(link, timeout=10)
+                        
+                        if response.status_code == 200:
+                            good_links.add(link)
+                            print(f"  ✅ {link}")
+                        else:
+                            failed_links.add(link)
+                            print(f"  ❌ {response.status_code} {link}")
+                    except Exception as e:
+                        failed_links.add(link)
+                        print(f"  ❌ {link}")
+
+        print(f"Found {len(links)} links")
+        print(f"Good links: {len(good_links)}")
+        
+        if len(failed_links) > 0:
+            print("Failed links:")
+            for link in sorted(failed_links):
+                print(f"  ❌ {link}")
+            sys.exit(1)
+        else:
+            print("No bad links found")
+            sys.exit(0)
+
+    except json.JSONDecodeError:
+        print(f"Error: Could not parse {filename} as JSON")
+    except KeyError:
+        print(f"Error: Unexpected notebook format in {filename}")
diff --git a/.github/workflows/test-notebooks-changed.yaml b/.github/workflows/test-notebooks-changed.yaml
@@ -60,3 +60,18 @@ jobs:
           notebook: ${{ matrix.notebook }}
           PINECONE_API_KEY: ${{ secrets.PINECONE_API_KEY }}
           OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
+
+  check-links:
+    needs:
+      - validate-notebooks
+      - detect-changes
+    if: needs.detect-changes.outputs.has_changes == 'true'
+    runs-on: ubuntu-latest
+    strategy:
+      fail-fast: false
+      matrix: ${{ fromJSON(needs.detect-changes.outputs.matrix) }}
+    steps:
+      - uses: actions/checkout@v4
+      - uses: ./.github/actions/check-links
+        with:
+          notebook: ${{ matrix.notebook }}
diff --git a/docs/langchain-retrieval-agent.ipynb b/docs/langchain-retrieval-agent.ipynb
@@ -17,21 +17,28 @@
     "id": "bhWwrfbbVGOA"
    },
    "source": [
-    "#### [LangChain Handbook](https://pinecone.io/learn/langchain)\n",
+    "#### [LangChain Handbook](https://www.pinecone.io/learn/series/langchain/)\n",
     "\n",
     "# Retrieval Agents\n",
     "\n",
-    "We've seen in previous chapters how powerful [retrieval augmentation](https://www.pinecone.io/learn/langchain-retrieval-augmentation/) and [conversational agents](https://www.pinecone.io/learn/langchain-agents/) can be. They become even more impressive when we begin using them together.\n",
+    "We've seen in previous chapters how powerful [retrieval augmentation](https://www.pinecone.io/learn/series/langchain/langchain-retrieval-augmentation/) and [conversational agents](https://www.pinecone.io/learn/series/langchain/langchain-agents/) can be. They become even more impressive when we begin using them together.\n",
     "\n",
     "Conversational agents can struggle with data freshness, knowledge about specific domains, or accessing internal documentation. By coupling agents with retrieval augmentation tools we no longer have these problems.\n",
     "\n",
     "One the other side, using \"naive\" retrieval augmentation without the use of an agent means we will retrieve contexts with *every* query. Again, this isn't always ideal as not every query requires access to external knowledge.\n",
     "\n",
     "Merging these methods gives us the best of both worlds. In this notebook we'll learn how to do this.\n",
     "\n",
-    "[![Open full notebook](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/full-link.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/generation/langchain/handbook/08-langchain-retrieval-agent.ipynb)\n",
+    "[![Open full notebook](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/full-link.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/generation/langchain/handbook/08-langchain-retrieval-agent.ipynb)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Prerequisites\n",
     "\n",
-    "To begin, we must install the prerequisite libraries that we will be using in this notebook."
+    "To begin, we must install several libraries that we will be using in this notebook."
    ]
   },
   {
@@ -415,7 +422,7 @@
     "- `name` can be anything we like. The name is used as an identifier for the index when performing other operations such as `describe_index`, `delete_index`, and so on. \n",
     "- `metric` specifies the similarity metric that will be used later when you make queries to the index.\n",
     "- `dimension` should correspond to the dimension of the dense vectors produced by your embedding model. In this quick start, we are using made-up data so a small value is simplest.\n",
-    "- `spec` holds a specification which tells Pinecone how you would like to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).\n",
+    "- `spec` holds a specification which tells Pinecone how you would like to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/troubleshooting/available-cloud-regions).\n",
     "\n",
     "There are more configurations available, but this minimal set will get us started."
    ]