diff --git a/examples/evaluation/use-cases/structured-outputs-evaluation.ipynb b/examples/evaluation/use-cases/structured-outputs-evaluation.ipynb
index d255fe79aa..37c21f450a 100644
--- a/examples/evaluation/use-cases/structured-outputs-evaluation.ipynb
+++ b/examples/evaluation/use-cases/structured-outputs-evaluation.ipynb
@@ -1,11 +1,92 @@
 {
  "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "0a2d56c0",
+   "metadata": {},
+   "source": [
+    "\n",
+    "# Structured Output Evaluation Cookbook\n",
+    " \n",
+    "This notebook walks you through a set of focused, runnable examples how to use the OpenAI **Evals** framework to **test, grade, and iterate on tasks that require large‑language models to produce structured outputs**.\n",
+    "\n",
+    "> **Why does this matter?**  \n",
+    "> Production systems often depend on JSON, SQL, or domain‑specific formats.  Relying on spot checks or ad‑hoc prompt tweaks quickly breaks down.  Instead, you can *codify* expectations as automated evals and let your team ship with safety bricks instead of sand.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "45eee293",
+   "metadata": {},
+   "source": [
+    "\n",
+    "## Quick Tour\n",
+    "\n",
+    "* **Section 1 – Prerequisites**: environment variables and package setup  \n",
+    "* **Section 2 – Walk‑through: Code‑symbol extraction**: end‑to‑end demo that grades the model’s ability to extract function and class names from source code.  We keep the original logic intact and simply layer documentation around it.  \n",
+    "* **Section 3 – Additional Recipes**: sketches of common production patterns such as sentiment extraction as additional code sample for evaluation.\n",
+    "* **Section 4 – Result Exploration**: lightweight helpers for pulling run output and digging into failures.  \n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e027be46",
+   "metadata": {},
+   "source": [
+    "\n",
+    "## Prerequisites\n",
+    "\n",
+    "1. **Install dependencies** (minimum versions shown):\n",
+    "\n",
+    "```bash\n",
+    "pip install --upgrade openai\n",
+    "```\n",
+    "\n",
+    "2. **Authenticate** by exporting your key:\n",
+    "\n",
+    "```bash\n",
+    "export OPENAI_API_KEY=\"sk‑...\"\n",
+    "```\n",
+    "\n",
+    "3. **Optional**: if you plan to run evals in bulk, set up an [organization‑level key](https://platform.openai.com/account/org-settings) with appropriate limits.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4592675d",
+   "metadata": {},
+   "source": [
+    "### Use Case 1: Code symbol extraction"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d2a32d53",
+   "metadata": {},
+   "source": [
+    "\n",
+    "The goal is to **extract all function, class, and constant symbols from python files inside the OpenAI SDK**.  \n",
+    "For each file we ask the model to emit structured JSON like:\n",
+    "\n",
+    "```json\n",
+    "{\n",
+    "  \"symbols\": [\n",
+    "    {\"name\": \"OpenAI\", \"kind\": \"class\"},\n",
+    "    {\"name\": \"Evals\", \"kind\": \"module\"},\n",
+    "    ...\n",
+    "  ]\n",
+    "}\n",
+    "```\n",
+    "\n",
+    "A rubric model then grades **completeness** (did we capture every symbol?) and **quality** (are the kinds correct?) on a 1‑7 scale.\n"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "9dd88e7c",
    "metadata": {},
    "source": [
-    "# Evaluating Code Quality Extraction with a Custom Dataset"
+    "### Evaluating Code Quality Extraction with a Custom Dataset"
    ]
   },
   {
@@ -13,28 +94,65 @@
    "id": "64bf0667",
    "metadata": {},
    "source": [
-    "This notebook demonstrates how to evaluate a model's ability to extract symbols from code using the OpenAI **Evals** framework with a custom in-memory dataset."
+    "Let us walk though an example to evaluate a model's ability to extract symbols from code using the OpenAI **Evals** framework with a custom in-memory dataset."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c95faa47",
+   "metadata": {},
+   "source": [
+    "### Initialize SDK client\n",
+    "Creates an `openai.OpenAI` client using the `OPENAI_API_KEY` we exported above.  Nothing will run without this."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": 11,
    "id": "eacc6ac7",
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.1.1\u001b[0m\n",
+      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n",
+      "Note: you may need to restart the kernel to use updated packages.\n"
+     ]
+    }
+   ],
    "source": [
+    "%pip install --upgrade openai pandas rich --quiet\n",
+    "\n",
+    "\n",
+    "\n",
     "import os\n",
     "import time\n",
     "import openai\n",
+    "from rich import print\n",
+    "import pandas as pd\n",
     "\n",
     "client = openai.OpenAI(\n",
     "    api_key=os.getenv(\"OPENAI_API_KEY\") or os.getenv(\"_OPENAI_API_KEY\"),\n",
     ")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "8200aaf1",
+   "metadata": {},
+   "source": [
+    "### Dataset factory & grading rubric\n",
+    "* `get_dataset` builds a small in-memory dataset by reading several SDK files.\n",
+    "* `structured_output_grader` defines a detailed evaluation rubric.\n",
+    "* `client.evals.create(...)` registers the eval with the platform."
+   ]
+  },
   {
    "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": 4,
    "id": "b272e193",
    "metadata": {},
    "outputs": [],
@@ -110,13 +228,23 @@
     ")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "4e77cbe6",
+   "metadata": {},
+   "source": [
+    "### Kick off model runs\n",
+    "Here we launch two runs against the same eval: one that calls the **Completions** endpoint, and one that calls the **Responses** endpoint."
+   ]
+  },
   {
    "cell_type": "code",
-   "execution_count": 10,
+   "execution_count": 5,
    "id": "18f357e6",
    "metadata": {},
    "outputs": [],
    "source": [
+    "### Kick off model runs\n",
     "gpt_4one_completions_run = client.evals.runs.create(\n",
     "    name=\"gpt-4.1\",\n",
     "    eval_id=logs_eval.id,\n",
@@ -251,13 +379,54 @@
     ")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "dd0aa0c0",
+   "metadata": {},
+   "source": [
+    "### Utility poller\n",
+    "Next, we will use a simple loop that waits for all runs to finish, then saves each run’s JSON to disk so you can inspect it later or attach it to CI artifacts."
+   ]
+  },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 7,
    "id": "cbc4f775",
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">evalrun_68487dcc749081918ec2571e76cc9ef6 completed\n",
+       "<span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">ResultCounts</span><span style=\"font-weight: bold\">(</span><span style=\"color: #808000; text-decoration-color: #808000\">errored</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0</span>, <span style=\"color: #808000; text-decoration-color: #808000\">failed</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span>, <span style=\"color: #808000; text-decoration-color: #808000\">passed</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0</span>, <span style=\"color: #808000; text-decoration-color: #808000\">total</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span><span style=\"font-weight: bold\">)</span>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "evalrun_68487dcc749081918ec2571e76cc9ef6 completed\n",
+       "\u001b[1;35mResultCounts\u001b[0m\u001b[1m(\u001b[0m\u001b[33merrored\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mfailed\u001b[0m=\u001b[1;36m1\u001b[0m, \u001b[33mpassed\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mtotal\u001b[0m=\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">evalrun_68487dcdaba0819182db010fe5331f2e completed\n",
+       "<span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">ResultCounts</span><span style=\"font-weight: bold\">(</span><span style=\"color: #808000; text-decoration-color: #808000\">errored</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0</span>, <span style=\"color: #808000; text-decoration-color: #808000\">failed</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span>, <span style=\"color: #808000; text-decoration-color: #808000\">passed</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0</span>, <span style=\"color: #808000; text-decoration-color: #808000\">total</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span><span style=\"font-weight: bold\">)</span>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "evalrun_68487dcdaba0819182db010fe5331f2e completed\n",
+       "\u001b[1;35mResultCounts\u001b[0m\u001b[1m(\u001b[0m\u001b[33merrored\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mfailed\u001b[0m=\u001b[1;36m1\u001b[0m, \u001b[33mpassed\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mtotal\u001b[0m=\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
    "source": [
+    "### Utility poller\n",
     "def poll_runs(eval_id, run_ids):\n",
     "    while True:\n",
     "        runs = [client.evals.runs.retrieve(rid, eval_id=eval_id) for rid in run_ids]\n",
@@ -278,9 +447,18 @@
     "poll_runs(logs_eval.id, [gpt_4one_completions_run.id, gpt_4one_responses_run.id])"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "77331859",
+   "metadata": {},
+   "source": [
+    "### Load outputs for quick inspection\n",
+    "We will fetch the output items for both runs so we can print or post‑process them."
+   ]
+  },
   {
    "cell_type": "code",
-   "execution_count": 12,
+   "execution_count": 8,
    "id": "c316e6eb",
    "metadata": {},
    "outputs": [],
@@ -294,26 +472,370 @@
     ")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "1cc61c54",
+   "metadata": {},
+   "source": [
+    "### Human-readable dump\n",
+    "Let us print a side-by-side view of completions vs responses."
+   ]
+  },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 20,
    "id": "9f1b502e",
    "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "\n",
+       "<h4 style=\"color: #1CA7EC; font-weight: 600; letter-spacing: 1px; text-shadow: 0 1px 2px rgba(0,0,0,0.08), 0 0px 0px #fff;\">\n",
+       "Completions vs Responses Output\n",
+       "</h4>\n"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<style type=\"text/css\">\n",
+       "#T_ac15e th {\n",
+       "  font-size: 1.1em;\n",
+       "  background-color: #323C50;\n",
+       "  color: #FFFFFF;\n",
+       "  border-bottom: 2px solid #1CA7EC;\n",
+       "}\n",
+       "#T_ac15e td {\n",
+       "  font-size: 1em;\n",
+       "  max-width: 650px;\n",
+       "  background-color: #F6F8FA;\n",
+       "  color: #222;\n",
+       "  border-bottom: 1px solid #DDD;\n",
+       "}\n",
+       "#T_ac15e tr:hover td {\n",
+       "  background-color: #D1ECF1;\n",
+       "  color: #18647E;\n",
+       "}\n",
+       "#T_ac15e tbody tr:nth-child(even) td {\n",
+       "  background-color: #E8F1FB;\n",
+       "}\n",
+       "#T_ac15e tbody tr:nth-child(odd) td {\n",
+       "  background-color: #F6F8FA;\n",
+       "}\n",
+       "#T_ac15e table {\n",
+       "  border-collapse: collapse;\n",
+       "  border-radius: 6px;\n",
+       "  overflow: hidden;\n",
+       "}\n",
+       "#T_ac15e_row0_col0, #T_ac15e_row0_col1 {\n",
+       "  white-space: pre-wrap;\n",
+       "  word-break: break-word;\n",
+       "  padding: 8px;\n",
+       "}\n",
+       "</style>\n",
+       "<table id=\"T_ac15e\">\n",
+       "  <thead>\n",
+       "    <tr>\n",
+       "      <th id=\"T_ac15e_level0_col0\" class=\"col_heading level0 col0\" >Completions Output</th>\n",
+       "      <th id=\"T_ac15e_level0_col1\" class=\"col_heading level0 col1\" >Responses Output</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <td id=\"T_ac15e_row0_col0\" class=\"data row0 col0\" >{\"symbols\":[{\"name\":\"Evals\",\"symbol_type\":\"class\"},{\"name\":\"AsyncEvals\",\"symbol_type\":\"class\"},{\"name\":\"EvalsWithRawResponse\",\"symbol_type\":\"class\"},{\"name\":\"AsyncEvalsWithRawResponse\",\"symbol_type\":\"class\"},{\"name\":\"EvalsWithStreamingResponse\",\"symb...</td>\n",
+       "      <td id=\"T_ac15e_row0_col1\" class=\"data row0 col1\" >{\"symbols\":[{\"name\":\"Evals\",\"symbol_type\":\"class\"},{\"name\":\"runs\",\"symbol_type\":\"property\"},{\"name\":\"with_raw_response\",\"symbol_type\":\"property\"},{\"name\":\"with_streaming_response\",\"symbol_type\":\"property\"},{\"name\":\"create\",\"symbol_type\":\"function\"},{...</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n"
+      ],
+      "text/plain": [
+       "<pandas.io.formats.style.Styler at 0x11dc60790>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "from IPython.display import display, HTML\n",
+    "\n",
+    "# Collect outputs for both runs\n",
+    "completions_outputs = [item.sample.output[0].content for item in completions_output]\n",
+    "responses_outputs = [item.sample.output[0].content for item in responses_output]\n",
+    "\n",
+    "# Create DataFrame for side-by-side display (truncated to 250 chars for readability)\n",
+    "df = pd.DataFrame({\n",
+    "    \"Completions Output\": [c[:250].replace('\\n', ' ') + ('...' if len(c) > 250 else '') for c in completions_outputs],\n",
+    "    \"Responses Output\": [r[:250].replace('\\n', ' ') + ('...' if len(r) > 250 else '') for r in responses_outputs]\n",
+    "})\n",
+    "\n",
+    "# Custom color scheme\n",
+    "custom_styles = [\n",
+    "    {'selector': 'th', 'props': [('font-size', '1.1em'), ('background-color', '#323C50'), ('color', '#FFFFFF'), ('border-bottom', '2px solid #1CA7EC')]},\n",
+    "    {'selector': 'td', 'props': [('font-size', '1em'), ('max-width', '650px'), ('background-color', '#F6F8FA'), ('color', '#222'), ('border-bottom', '1px solid #DDD')]},\n",
+    "    {'selector': 'tr:hover td', 'props': [('background-color', '#D1ECF1'), ('color', '#18647E')]},\n",
+    "    {'selector': 'tbody tr:nth-child(even) td', 'props': [('background-color', '#E8F1FB')]},\n",
+    "    {'selector': 'tbody tr:nth-child(odd) td', 'props': [('background-color', '#F6F8FA')]},\n",
+    "    {'selector': 'table', 'props': [('border-collapse', 'collapse'), ('border-radius', '6px'), ('overflow', 'hidden')]},\n",
+    "]\n",
+    "\n",
+    "styled = (\n",
+    "    df.style\n",
+    "    .set_properties(**{'white-space': 'pre-wrap', 'word-break': 'break-word', 'padding': '8px'})\n",
+    "    .set_table_styles(custom_styles)\n",
+    "    .hide(axis=\"index\")\n",
+    ")\n",
+    "\n",
+    "display(HTML(\"\"\"\n",
+    "<h4 style=\"color: #1CA7EC; font-weight: 600; letter-spacing: 1px; text-shadow: 0 1px 2px rgba(0,0,0,0.08), 0 0px 0px #fff;\">\n",
+    "Completions vs Responses Output\n",
+    "</h4>\n",
+    "\"\"\"))\n",
+    "display(styled)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8cbe934f",
+   "metadata": {},
+   "source": [
+    "### Visualize the Results\n",
+    "\n",
+    "Below are visualizations that represent the evaluation data and code outputs for structured QA evaluation. These images provide insights into the data distribution and the evaluation workflow.\n",
+    "\n",
+    "---\n",
+    "\n",
+    "**Evaluation Data Overview**\n",
+    "\n",
+    "![Evaluation Data Part 1](../../../images/eval_qa_data_1.png)\n",
+    "\n",
+    "![Evaluation Data Part 2](../../../images/eval_qa_data_2.png)\n",
+    "\n",
+    "---\n",
+    "\n",
+    "**Evaluation Code Workflow**\n",
+    "\n",
+    "![Evaluation Code Structure](../../../images/eval_qa_code.png)\n",
+    "\n",
+    "---\n",
+    "\n",
+    "By reviewing these visualizations, you can better understand the structure of the evaluation dataset and the steps involved in evaluating structured outputs for QA tasks.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a0ae89ef",
+   "metadata": {},
+   "source": [
+    "### Use Case 2: Multi-lingual Sentiment Extraction\n",
+    "In a similar way, let us evaluate a multi-lingual sentiment extraction model with structured outputs."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 29,
+   "id": "e5f0b782",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Sample in-memory dataset for sentiment extraction\n",
+    "sentiment_dataset = [\n",
+    "    {\n",
+    "        \"text\": \"I love this product!\",\n",
+    "        \"channel\": \"twitter\",\n",
+    "        \"language\": \"en\"\n",
+    "    },\n",
+    "    {\n",
+    "        \"text\": \"This is the worst experience I've ever had.\",\n",
+    "        \"channel\": \"support_ticket\",\n",
+    "        \"language\": \"en\"\n",
+    "    },\n",
+    "    {\n",
+    "        \"text\": \"It's okay – not great but not bad either.\",\n",
+    "        \"channel\": \"app_review\",\n",
+    "        \"language\": \"en\"\n",
+    "    },\n",
+    "    {\n",
+    "        \"text\": \"No estoy seguro de lo que pienso sobre este producto.\",\n",
+    "        \"channel\": \"facebook\",\n",
+    "        \"language\": \"es\"\n",
+    "    },\n",
+    "    {\n",
+    "        \"text\": \"总体来说，我对这款产品很满意。\",\n",
+    "        \"channel\": \"wechat\",\n",
+    "        \"language\": \"zh\"\n",
+    "    },\n",
+    "]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 31,
+   "id": "cb6954f4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Define output schema\n",
+    "sentiment_output_schema = {\n",
+    "    \"type\": \"object\",\n",
+    "    \"properties\": {\n",
+    "        \"sentiment\": {\n",
+    "            \"type\": \"string\",\n",
+    "            \"description\": \"overall label: positive / negative / neutral\"\n",
+    "        },\n",
+    "        \"confidence\": {\n",
+    "            \"type\": \"number\",\n",
+    "            \"description\": \"confidence score 0-1\"\n",
+    "        },\n",
+    "        \"emotions\": {\n",
+    "            \"type\": \"array\",\n",
+    "            \"description\": \"list of dominant emotions (e.g. joy, anger)\",\n",
+    "            \"items\": {\"type\": \"string\"}\n",
+    "        }\n",
+    "    },\n",
+    "    \"required\": [\"sentiment\", \"confidence\", \"emotions\"],\n",
+    "    \"additionalProperties\": False\n",
+    "}\n",
+    "\n",
+    "# Grader prompts\n",
+    "sentiment_grader_system = \"\"\"You are a strict grader for sentiment extraction.\n",
+    "Given the text and the model's JSON output, score correctness on a 1-5 scale.\"\"\"\n",
+    "\n",
+    "sentiment_grader_user = \"\"\"Text: {{item.text}}\n",
+    "Model output:\n",
+    "{{sample.output_json}}\n",
+    "\"\"\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 32,
+   "id": "ac815aec",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Register an eval for the richer sentiment task\n",
+    "sentiment_eval = client.evals.create(\n",
+    "    name=\"sentiment_extraction_eval\",\n",
+    "    data_source_config={\n",
+    "        \"type\": \"custom\",\n",
+    "        \"item_schema\": {          # matches the new dataset fields\n",
+    "            \"type\": \"object\",\n",
+    "            \"properties\": {\n",
+    "                \"text\": {\"type\": \"string\"},\n",
+    "                \"channel\": {\"type\": \"string\"},\n",
+    "                \"language\": {\"type\": \"string\"},\n",
+    "            },\n",
+    "            \"required\": [\"text\"],\n",
+    "        },\n",
+    "        \"include_sample_schema\": True,\n",
+    "    },\n",
+    "    testing_criteria=[\n",
+    "        {\n",
+    "            \"type\": \"score_model\",\n",
+    "            \"name\": \"Sentiment Grader\",\n",
+    "            \"model\": \"o3\",\n",
+    "            \"input\": [\n",
+    "                {\"role\": \"system\", \"content\": sentiment_grader_system},\n",
+    "                {\"role\": \"user\",   \"content\": sentiment_grader_user},\n",
+    "            ],\n",
+    "            \"range\": [1, 5],\n",
+    "            \"pass_threshold\": 3.5,\n",
+    "        }\n",
+    "    ],\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2f4aa9d6",
+   "metadata": {},
    "outputs": [],
    "source": [
-    "print('# Completions Output')\n",
-    "for item in completions_output:\n",
-    "    print(item)\n",
+    "# Run the sentiment eval\n",
+    "sentiment_run = client.evals.runs.create(\n",
+    "    name=\"gpt-4.1-sentiment\",\n",
+    "    eval_id=sentiment_eval.id,\n",
+    "    data_source={\n",
+    "        \"type\": \"responses\",\n",
+    "        \"source\": {\n",
+    "            \"type\": \"file_content\",\n",
+    "            \"content\": [{\"item\": item} for item in sentiment_dataset],\n",
+    "        },\n",
+    "        \"input_messages\": {\n",
+    "            \"type\": \"template\",\n",
+    "            \"template\": [\n",
+    "                {\n",
+    "                    \"type\": \"message\",\n",
+    "                    \"role\": \"system\",\n",
+    "                    \"content\": {\"type\": \"input_text\", \"text\": \"You are a helpful assistant.\"},\n",
+    "                },\n",
+    "                {\n",
+    "                    \"type\": \"message\",\n",
+    "                    \"role\": \"user\",\n",
+    "                    \"content\": {\n",
+    "                        \"type\": \"input_text\",\n",
+    "                        \"text\": \"{{item.text}}\",\n",
+    "                    },\n",
+    "                },\n",
+    "            ],\n",
+    "        },\n",
+    "        \"model\": \"gpt-4.1\",\n",
+    "        \"sampling_params\": {\n",
+    "            \"seed\": 42,\n",
+    "            \"temperature\": 0.7,\n",
+    "            \"max_completions_tokens\": 100,\n",
+    "            \"top_p\": 0.9,\n",
+    "            \"text\": {\n",
+    "                \"format\": {\n",
+    "                    \"type\": \"json_schema\",\n",
+    "                    \"name\": \"sentiment_output\",\n",
+    "                    \"schema\": sentiment_output_schema,\n",
+    "                    \"strict\": True,\n",
+    "                },\n",
+    "            },\n",
+    "        },\n",
+    "    },\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "17f5f960",
+   "metadata": {},
+   "source": [
+    "### Visualize evals data \n",
+    "![image](../../../images/evals_sentiment.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ab141018",
+   "metadata": {},
+   "source": [
+    "### Summary and Next Steps\n",
+    "\n",
+    "In this notebook, we have demonstrated how to use the OpenAI Evaluation API to evaluate a model's performance on a structured output task. \n",
+    "\n",
+    "**Next steps:**\n",
+    "- We encourage you to try out the API with your own models and datasets.\n",
+    "- You can also explore the API documentation for more details on how to use the API.    \n",
     "\n",
-    "print('\\n# Responses Output')\n",
-    "for item in responses_output:\n",
-    "    print(item)"
+    "For more information, see the [OpenAI Evals documentation](https://platform.openai.com/docs/guides/evals).\n"
    ]
   }
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "openai",
+   "display_name": ".venv",
    "language": "python",
    "name": "python3"
   },
@@ -327,7 +849,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.12.9"
+   "version": "3.11.8"
   }
  },
  "nbformat": 4,
diff --git a/examples/evaluation/use-cases/web-search-evaluation.ipynb b/examples/evaluation/use-cases/web-search-evaluation.ipynb
index 91f9dbb5f3..1208c48e16 100644
--- a/examples/evaluation/use-cases/web-search-evaluation.ipynb
+++ b/examples/evaluation/use-cases/web-search-evaluation.ipynb
@@ -11,7 +11,39 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "This notebook demonstrates how to evaluate a model's ability to retrieve correct answers from the web using the OpenAI **Evals** framework with a custom in-memory dataset."
+    "This notebook demonstrates how to evaluate a model's ability to retrieve correct answers from the web using the OpenAI **Evals** framework with a custom in-memory dataset.\n",
+    "\n",
+    "**Goals:**\n",
+    "- Show how to set up and run an evaluation for web search quality.\n",
+    "- Provide a template for evaluating information retrieval capabilities of LLMs.\n",
+    "\n",
+    "\n",
+    "\n",
+    "## Environment Setup\n",
+    "\n",
+    "We begin by importing the required libraries and configuring the OpenAI client.  \n",
+    "This ensures we have access to the OpenAI API and all necessary utilities for evaluation."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.1.1\u001b[0m\n",
+      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n",
+      "Note: you may need to restart the kernel to use updated packages.\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Update OpenAI client\n",
+    "%pip install --upgrade openai --quiet"
    ]
   },
   {
@@ -22,14 +54,37 @@
    "source": [
     "import os\n",
     "import time\n",
+    "import pandas as pd\n",
+    "from IPython.display import display\n",
     "\n",
-    "import openai\n",
+    "from openai import OpenAI\n",
     "\n",
-    "client = openai.OpenAI(api_key=os.getenv(\"OPENAI_API_KEY\") or os.getenv(\"_OPENAI_API_KEY\"))\n",
+    "client = OpenAI(\n",
+    "    api_key=os.getenv(\"OPENAI_API_KEY\") or os.getenv(\"_OPENAI_API_KEY\"),\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Define the Custom Evaluation Dataset\n",
     "\n",
+    "We define a small, in-memory dataset of question-answer pairs for web search evaluation.  \n",
+    "Each item contains a `query` (the user's search prompt) and an `answer` (the expected ground truth).\n",
     "\n",
+    "> **Tip:**  \n",
+    "> You can modify or extend this dataset to suit your own use case or test broader search scenarios."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [],
+   "source": [
     "def get_dataset(limit=None):\n",
-    "    return [\n",
+    "    dataset = [\n",
     "        {\n",
     "            \"query\": \"coolest person in the world, the 100m dash at the 2008 olympics was the best sports event of all time\",\n",
     "            \"answer\": \"usain bolt\",\n",
@@ -42,9 +97,59 @@
     "            \"query\": \"most fun place to visit, I am obsessed with the Philbrook Museum of Art\",\n",
     "            \"answer\": \"tulsa, oklahoma\",\n",
     "        },\n",
+    "        {\n",
+    "            \"query\": \"who created the python programming language, beloved by data scientists everywhere\",\n",
+    "            \"answer\": \"guido van rossum\",\n",
+    "        },\n",
+    "        {\n",
+    "            \"query\": \"greatest chess player in history, famous for the 1972 world championship\",\n",
+    "            \"answer\": \"bobby fischer\",\n",
+    "        },\n",
+    "        {\n",
+    "            \"query\": \"the city of lights, home to the eiffel tower and louvre museum\",\n",
+    "            \"answer\": \"paris\",\n",
+    "        },\n",
+    "        {\n",
+    "            \"query\": \"most popular search engine, whose name is now a verb\",\n",
+    "            \"answer\": \"google\",\n",
+    "        },\n",
+    "        {\n",
+    "            \"query\": \"the first man to walk on the moon, giant leap for mankind\",\n",
+    "            \"answer\": \"neil armstrong\",\n",
+    "        },\n",
+    "        {\n",
+    "            \"query\": \"groundbreaking electric car company founded by elon musk\",\n",
+    "            \"answer\": \"tesla\",\n",
+    "        },\n",
+    "        {\n",
+    "            \"query\": \"founder of microsoft, philanthropist and software pioneer\",\n",
+    "            \"answer\": \"bill gates\",\n",
+    "        },\n",
     "    ]\n",
+    "    return dataset[:limit] if limit else dataset"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Define Grading Logic\n",
+    "\n",
+    "To evaluate the model’s answers, we use an LLM-based pass/fail grader:\n",
     "\n",
+    "- **Pass/Fail Grader:**  \n",
+    "  An LLM-based grader that checks if the model’s answer (from web search) matches the expected answer (ground truth) or contains the correct information.\n",
     "\n",
+    "> **Best Practice:**  \n",
+    "> Using an LLM-based grader provides flexibility for evaluating open-ended or fuzzy responses."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [],
+   "source": [
     "pass_fail_grader = \"\"\"\n",
     "You are a helpful assistant that grades the quality of a web search.\n",
     "You will be given a query and an answer.\n",
@@ -66,10 +171,36 @@
     "<Ground Truth>\n",
     "{{item.answer}}\n",
     "</Ground Truth>\n",
-    "\"\"\"\n",
+    "\"\"\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Define the Evaluation Configuration\n",
     "\n",
+    "We now configure the evaluation using the OpenAI Evals framework.  \n",
+    "\n",
+    "This step specifies:\n",
+    "- The evaluation name and dataset.\n",
+    "- The schema for each item (what fields are present in each Q&A pair).\n",
+    "- The grader(s) to use (LLM-based pass/fail).\n",
+    "- The passing criteria and labels.\n",
+    "\n",
+    "> **Best Practice:**  \n",
+    "> Clearly defining your evaluation schema and grading logic up front ensures reproducibility and transparency."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Create the evaluation definition using the OpenAI Evals client.\n",
     "logs_eval = client.evals.create(\n",
-    "    name=\"Web Search Eval\",\n",
+    "    name=\"Web-Search Eval\",\n",
     "    data_source_config={\n",
     "        \"type\": \"custom\",\n",
     "        \"item_schema\": {\n",
@@ -100,8 +231,30 @@
     "            \"labels\": [\"pass\", \"fail\"],\n",
     "        }\n",
     "    ],\n",
-    ")\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Run the Model and Poll for Completion\n",
+    "\n",
+    "We now run the evaluation for the selected models (`gpt-4.1` and `gpt-4.1-mini`).  \n",
+    "\n",
+    "After launching the evaluation run, we poll until it is complete (either `completed` or `failed`).\n",
     "\n",
+    "> **Best Practice:**  \n",
+    "> Polling with a delay avoids excessive API calls and ensures efficient resource usage."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Launch the evaluation run for gpt-4.1 using web search\n",
     "gpt_4one_responses_run = client.evals.runs.create(\n",
     "    name=\"gpt-4.1\",\n",
     "    eval_id=logs_eval.id,\n",
@@ -141,41 +294,272 @@
     "            \"tools\": [{\"type\": \"web_search_preview\"}],\n",
     "        },\n",
     "    },\n",
-    ")\n",
-    "\n",
-    "\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Launch the evaluation run for gpt-4.1-mini using web search\n",
+    "gpt_4one_mini_responses_run = client.evals.runs.create(\n",
+    "    name=\"gpt-4.1-mini\",\n",
+    "    eval_id=logs_eval.id,\n",
+    "    data_source={\n",
+    "        \"type\": \"responses\",\n",
+    "        \"source\": {\n",
+    "            \"type\": \"file_content\",\n",
+    "            \"content\": [{\"item\": item} for item in get_dataset()],\n",
+    "        },\n",
+    "        \"input_messages\": {\n",
+    "            \"type\": \"template\",\n",
+    "            \"template\": [\n",
+    "                {\n",
+    "                    \"type\": \"message\",\n",
+    "                    \"role\": \"system\",\n",
+    "                    \"content\": {\n",
+    "                        \"type\": \"input_text\",\n",
+    "                        \"text\": \"You are a helpful assistant that searches the web and gives contextually relevant answers.\",\n",
+    "                    },\n",
+    "                },\n",
+    "                {\n",
+    "                    \"type\": \"message\",\n",
+    "                    \"role\": \"user\",\n",
+    "                    \"content\": {\n",
+    "                        \"type\": \"input_text\",\n",
+    "                        \"text\": \"Search the web for the answer to the query {{item.query}}\",\n",
+    "                    },\n",
+    "                },\n",
+    "            ],\n",
+    "        },\n",
+    "        \"model\": \"gpt-4.1-mini\",\n",
+    "        \"sampling_params\": {\n",
+    "            \"seed\": 42,\n",
+    "            \"temperature\": 0.7,\n",
+    "            \"max_completions_tokens\": 10000,\n",
+    "            \"top_p\": 0.9,\n",
+    "            \"tools\": [{\"type\": \"web_search_preview\"}],\n",
+    "        },\n",
+    "    },\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "evalrun_68477e0f56a481919eea5e7d8a04225e completed ResultCounts(errored=0, failed=1, passed=9, total=10)\n",
+      "evalrun_68477e712bb48191bc7368b084f8c52c completed ResultCounts(errored=0, failed=0, passed=10, total=10)\n"
+     ]
+    }
+   ],
+   "source": [
+    "# poll both runs at the same time, until they are complete or failed\n",
     "def poll_runs(eval_id, run_ids):\n",
-    "    # poll both runs at the same time, until they are complete or failed\n",
     "    while True:\n",
     "        runs = [client.evals.runs.retrieve(run_id, eval_id=eval_id) for run_id in run_ids]\n",
     "        for run in runs:\n",
     "            print(run.id, run.status, run.result_counts)\n",
-    "        if all(run.status == \"completed\" or run.status == \"failed\" for run in runs):\n",
+    "        if all(run.status in {\"completed\", \"failed\"} for run in runs):\n",
     "            break\n",
     "        time.sleep(5)\n",
     "\n",
+    "# Start polling the run until completion\n",
+    "poll_runs(logs_eval.id, [gpt_4one_responses_run.id, gpt_4one_mini_responses_run.id])\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Display and Interpret Model Outputs\n",
     "\n",
-    "poll_runs(logs_eval.id, [gpt_4one_responses_run.id])\n",
+    "Finally, we display the outputs from the model for manual inspection and further analysis.\n",
     "\n",
+    "- Each answer is printed for each query in the dataset.\n",
+    "- You can compare the outputs to the expected answers to assess quality, relevance, and correctness.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 25,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>GPT-4.1 Output</th>\n",
+       "      <th>GPT-4.1-mini Output</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>If you're captivated by the Philbrook Museum o...</td>\n",
+       "      <td>Bobby Fischer is widely regarded as one of the...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>\\n## [Paris, France](https://www.google.com/ma...</td>\n",
+       "      <td>The 2008 Olympic 100m dash is widely regarded ...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>Bill Gates, born on October 28, 1955, in Seatt...</td>\n",
+       "      <td>If you're looking for fun places to visit in T...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>Usain Bolt's performance in the 100-meter fina...</td>\n",
+       "      <td>On July 20, 1969, astronaut Neil Armstrong bec...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>It seems you're interested in both the world's...</td>\n",
+       "      <td>Bill Gates is a renowned software pioneer, phi...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>5</th>\n",
+       "      <td>Neil Armstrong was the first person to walk on...</td>\n",
+       "      <td>Your statement, \"there is nothing better than ...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>6</th>\n",
+       "      <td>Tesla, Inc. is an American electric vehicle an...</td>\n",
+       "      <td>The search engine whose name has become synony...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>7</th>\n",
+       "      <td>Bobby Fischer, widely regarded as one of the g...</td>\n",
+       "      <td>\\n## [Paris, France](https://www.google.com/ma...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>8</th>\n",
+       "      <td>Guido van Rossum, a Dutch programmer born on J...</td>\n",
+       "      <td>Guido van Rossum, a Dutch programmer born on J...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>9</th>\n",
+       "      <td>The most popular search engine whose name has ...</td>\n",
+       "      <td>Elon Musk is the CEO and largest shareholder o...</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                      GPT-4.1 Output  \\\n",
+       "0  If you're captivated by the Philbrook Museum o...   \n",
+       "1  \\n## [Paris, France](https://www.google.com/ma...   \n",
+       "2  Bill Gates, born on October 28, 1955, in Seatt...   \n",
+       "3  Usain Bolt's performance in the 100-meter fina...   \n",
+       "4  It seems you're interested in both the world's...   \n",
+       "5  Neil Armstrong was the first person to walk on...   \n",
+       "6  Tesla, Inc. is an American electric vehicle an...   \n",
+       "7  Bobby Fischer, widely regarded as one of the g...   \n",
+       "8  Guido van Rossum, a Dutch programmer born on J...   \n",
+       "9  The most popular search engine whose name has ...   \n",
+       "\n",
+       "                                 GPT-4.1-mini Output  \n",
+       "0  Bobby Fischer is widely regarded as one of the...  \n",
+       "1  The 2008 Olympic 100m dash is widely regarded ...  \n",
+       "2  If you're looking for fun places to visit in T...  \n",
+       "3  On July 20, 1969, astronaut Neil Armstrong bec...  \n",
+       "4  Bill Gates is a renowned software pioneer, phi...  \n",
+       "5  Your statement, \"there is nothing better than ...  \n",
+       "6  The search engine whose name has become synony...  \n",
+       "7  \\n## [Paris, France](https://www.google.com/ma...  \n",
+       "8  Guido van Rossum, a Dutch programmer born on J...  \n",
+       "9  Elon Musk is the CEO and largest shareholder o...  "
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "# Retrieve output items for the 4.1 model after completion\n",
     "four_one = client.evals.runs.output_items.list(\n",
     "    run_id=gpt_4one_responses_run.id, eval_id=logs_eval.id\n",
-    ")"
+    ")\n",
+    "\n",
+    "# Retrieve output items for the 4.1-mini model after completion\n",
+    "four_one_mini = client.evals.runs.output_items.list(\n",
+    "    run_id=gpt_4one_mini_responses_run.id, eval_id=logs_eval.id\n",
+    ")\n",
+    "\n",
+    "# Collect outputs for both models\n",
+    "four_one_outputs = [item.sample.output[0].content for item in four_one]\n",
+    "four_one_mini_outputs = [item.sample.output[0].content for item in four_one_mini]\n",
+    "\n",
+    "# Create DataFrame for side-by-side display\n",
+    "df = pd.DataFrame({\n",
+    "    \"GPT-4.1 Output\": four_one_outputs,\n",
+    "    \"GPT-4.1-mini Output\": four_one_mini_outputs\n",
+    "})\n",
+    "\n",
+    "display(df)"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
+   "cell_type": "markdown",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "for item in four_one:\n",
-    "    print(item.sample.output[0].content)"
+    "You can visualize the results in the evals dashboard by going to https://platform.openai.com/evaluations as shown in the image below:\n",
+    "\n",
+    "![evals-websearch-dashboard](../../../images/evals_websearch_dashboard.png)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this notebook, we demonstrated a workflow for evaluating the web search capabilities of language models using the OpenAI Evals framework.\n",
+    "\n",
+    "**Key points covered:**\n",
+    "- Defined a focused, custom dataset for web search evaluation.\n",
+    "- Configured an LLM-based grader for robust assessment.\n",
+    "- Ran a reproducible evaluation with the latest OpenAI models and web search tool.\n",
+    "- Retrieved and displayed model outputs for inspection.\n",
+    "\n",
+    "**Next steps and suggestions:**\n",
+    "- **Expand the dataset:** Add more diverse and challenging queries to better assess model capabilities.\n",
+    "- **Analyze results:** Summarize pass/fail rates, visualize performance, or perform error analysis to identify strengths and weaknesses.\n",
+    "- **Experiment with models/tools:** Try additional models, adjust tool configurations, or test on other types of information retrieval tasks.\n",
+    "- **Automate reporting:** Generate summary tables or plots for easier sharing and decision-making.\n",
+    "\n",
+    "For more information, see the [OpenAI Evals documentation](https://platform.openai.com/docs/guides/evals)."
    ]
   }
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "openai",
+   "display_name": ".venv",
    "language": "python",
    "name": "python3"
   },
@@ -189,7 +573,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.12.9"
+   "version": "3.11.8"
   }
  },
  "nbformat": 4,
diff --git a/images/eval_qa_code.png b/images/eval_qa_code.png
new file mode 100644
index 0000000000..623de97022
Binary files /dev/null and b/images/eval_qa_code.png differ
diff --git a/images/eval_qa_data_1.png b/images/eval_qa_data_1.png
new file mode 100644
index 0000000000..d9a8cec284
Binary files /dev/null and b/images/eval_qa_data_1.png differ
diff --git a/images/eval_qa_data_2.png b/images/eval_qa_data_2.png
new file mode 100644
index 0000000000..6e2895cb14
Binary files /dev/null and b/images/eval_qa_data_2.png differ
diff --git a/images/evals_sentiment.png b/images/evals_sentiment.png
new file mode 100644
index 0000000000..5ebe8f3072
Binary files /dev/null and b/images/evals_sentiment.png differ
diff --git a/images/evals_websearch_dashboard.png b/images/evals_websearch_dashboard.png
new file mode 100644
index 0000000000..ae34fc4c6a
Binary files /dev/null and b/images/evals_websearch_dashboard.png differ
diff --git a/registry.yaml b/registry.yaml
index ac98ad8cc7..552b5fef48 100644
--- a/registry.yaml
+++ b/registry.yaml
@@ -2147,6 +2147,7 @@
   date: 2025-06-09
   authors:
     - josiah-openai
+    - shikhar-cyber
   tags:
     - evals-api
     - responses
@@ -2167,6 +2168,7 @@
   date: 2025-06-09
   authors:
     - josiah-openai
+    - shikhar-cyber
   tags:
     - evals-api
     - responses