Migrate Hybrid Search Labs Notebook from RRF to Retrievers API (Elasticsearch 9.x) (#459)

mridula-s109 · web-flow · commit e8f446eff62e · 2025-05-29T14:44:40.000+01:00
This commit migrates the Hybrid Search Labs Notebook to use the new Retrievers API in Elasticsearch 9.x, replacing the previous RRF-based implementation. The necessary changes to the JSON structure for the retriever have been addressed to ensure compatibility with the updated API.

Note: While the migration covers the JSON and API integration, there may be underlying issues with the notebook execution or the Makefile that are not fully resolved in this commit. Further investigation and testing may be required to ensure smooth operation.
diff --git a/notebooks/search/02-hybrid-search.ipynb b/notebooks/search/02-hybrid-search.ipynb
@@ -196,21 +196,21 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
     "def pretty_response(response):\n",
     "    if len(response[\"hits\"][\"hits\"]) == 0:\n",
     "        print(\"Your search returned no results.\")\n",
     "    else:\n",
-    "        for hit in response[\"hits\"][\"hits\"]:\n",
+    "        for idx, hit in enumerate(response[\"hits\"][\"hits\"], start=1):\n",
     "            id = hit[\"_id\"]\n",
     "            publication_date = hit[\"_source\"][\"publish_date\"]\n",
-    "            rank = hit[\"_rank\"]\n",
+    "            score = hit[\"_score\"]\n",
     "            title = hit[\"_source\"][\"title\"]\n",
     "            summary = hit[\"_source\"][\"summary\"]\n",
-    "            pretty_output = f\"\\nID: {id}\\nPublication date: {publication_date}\\nTitle: {title}\\nSummary: {summary}\\nRank: {rank}\"\n",
+    "            pretty_output = f\"\\nID: {id}\\nPublication date: {publication_date}\\nTitle: {title}\\nSummary: {summary}\\nRank: {idx}\\nScore: {score}\"\n",
     "            print(pretty_output)"
    ]
   },
@@ -231,12 +231,12 @@
     "\n",
     "We then use [Reciprocal Rank Fusion (RRF)](https://www.elastic.co/guide/en/elasticsearch/reference/current/rrf.html) to balance the scores to provide a final list of documents, ranked in order of relevance. RRF is a ranking algorithm for combining results from different information retrieval strategies.\n",
     "\n",
-    "Note that _score is null, and we instead use _rank to show our top-ranked documents."
+    "Note: With the retriever API, _score contains the document’s relevance score, and the rank is simply the position in the results (first result is rank 1, etc.)."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": null,
    "metadata": {},
    "outputs": [
     {
@@ -280,18 +280,22 @@
     "response = client.search(\n",
     "    index=\"book_index\",\n",
     "    size=5,\n",
-    "    query={\"match\": {\"summary\": \"python programming\"}},\n",
-    "    knn={\n",
-    "        \"field\": \"title_vector\",\n",
-    "        \"query_vector\": model.encode(\n",
-    "            \"python programming\"\n",
-    "        ).tolist(),  # generate embedding for query so it can be compared to `title_vector`\n",
-    "        \"k\": 5,\n",
-    "        \"num_candidates\": 10,\n",
+    "    retriever={\n",
+    "        \"rrf\": {\n",
+    "            \"retrievers\": [\n",
+    "                {\"standard\": {\"query\": {\"match\": {\"summary\": \"python programming\"}}}},\n",
+    "                {\n",
+    "                    \"knn\": {\n",
+    "                        \"field\": \"title_vector\",\n",
+    "                        \"query_vector\": model.encode(\"python programming\").tolist(),\n",
+    "                        \"k\": 5,\n",
+    "                        \"num_candidates\": 10,\n",
+    "                    }\n",
+    "                },\n",
+    "            ]\n",
+    "        }\n",
     "    },\n",
-    "    rank={\"rrf\": {}},\n",
     ")\n",
-    "\n",
     "pretty_response(response)"
    ]
   }