diff --git a/examples/evaluation/use-cases/tools-evaluation.ipynb b/examples/evaluation/use-cases/tools-evaluation.ipynb
index cd5c72b52e..5bdf49829c 100644
--- a/examples/evaluation/use-cases/tools-evaluation.ipynb
+++ b/examples/evaluation/use-cases/tools-evaluation.ipynb
@@ -1,268 +1,736 @@
 {
-  "cells": [
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "# Evaluating Code Symbol Extraction Quality with a Custom Dataset"
-      ]
-    },
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "6ff95379",
+   "metadata": {},
+   "source": [
+    "# Tool Evaluation with OpenAI Evals\n",
+    "\n",
+    "This cookbook shows how to **measure and improve a model’s ability to extract structured information from source code** with tool evaluation. In this case, the set of *symbols* (functions, classes, methods, and variables) defined in Python files.  "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4cc30394",
+   "metadata": {},
+   "source": [
+    "## Setup<a name=\"Setup\"></a>\n",
+    "\n",
+    "Install the latest **openai** Python package ≥ 1.14.0 and set your `OPENAI_API_KEY` environment variable.  If you also want to evaluate an *assistant with tools*, enable the *Assistants v2 beta* in your account.\n",
+    "\n",
+    "```bash\n",
+    "pip install --upgrade openai\n",
+    "export OPENAI_API_KEY=sk‑...\n",
+    "```\n",
+    "Below we import the SDK, create a client, and define a helper that builds a small dataset from files inside the **openai** package itself."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "acd0d746",
+   "metadata": {},
+   "outputs": [
     {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "This notebook demonstrates how to evaluate a model's ability to extract symbols from code files using the OpenAI **Evals** framework with a custom in-memory dataset."
-      ]
-    },
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.1.1\u001b[0m\n",
+      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n",
+      "Note: you may need to restart the kernel to use updated packages.\n"
+     ]
+    }
+   ],
+   "source": [
+    "%pip install --upgrade openai pandas jinja2 rich --quiet\n",
+    "\n",
+    "import os\n",
+    "import time\n",
+    "import openai\n",
+    "from rich import print\n",
+    "\n",
+    "client = openai.OpenAI(\n",
+    "    api_key=os.getenv(\"OPENAI_API_KEY\") or os.getenv(\"_OPENAI_API_KEY\"),\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "80618b60",
+   "metadata": {},
+   "source": [
+    "### Dataset factory & grading rubric\n",
+    "* `get_dataset` builds a small in-memory dataset by reading several SDK files.\n",
+    "* `structured_output_grader` defines a detailed evaluation rubric. \n",
+    "* `sampled.output_tools[0].function.arguments.symbols` specifies the extracted symbols from the code file based on the tool invocation.\n",
+    "* `client.evals.create(...)` registers the eval with the platform."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "120b6e4d",
+   "metadata": {
+    "tags": [
+     "original"
+    ]
+   },
+   "outputs": [],
+   "source": [
+    "def get_dataset(limit=None):\n",
+    "    openai_sdk_file_path = os.path.dirname(openai.__file__)\n",
+    "\n",
+    "    file_paths = [\n",
+    "        os.path.join(openai_sdk_file_path, \"resources\", \"evals\", \"evals.py\"),\n",
+    "        os.path.join(openai_sdk_file_path, \"resources\", \"responses\", \"responses.py\"),\n",
+    "        os.path.join(openai_sdk_file_path, \"resources\", \"images.py\"),\n",
+    "        os.path.join(openai_sdk_file_path, \"resources\", \"embeddings.py\"),\n",
+    "        os.path.join(openai_sdk_file_path, \"resources\", \"files.py\"),\n",
+    "    ]\n",
+    "\n",
+    "    items = []\n",
+    "    for file_path in file_paths:\n",
+    "        items.append({\"input\": open(file_path, \"r\").read()})\n",
+    "    if limit:\n",
+    "        return items[:limit]\n",
+    "    return items\n",
+    "\n",
+    "\n",
+    "structured_output_grader = \"\"\"\n",
+    "You are a helpful assistant that grades the quality of extracted information from a code file.\n",
+    "You will be given a code file and a list of extracted information.\n",
+    "You should grade the quality of the extracted information.\n",
+    "\n",
+    "You should grade the quality on a scale of 1 to 7.\n",
+    "You should apply the following criteria, and calculate your score as follows:\n",
+    "You should first check for completeness on a scale of 1 to 7.\n",
+    "Then you should apply a quality modifier.\n",
+    "\n",
+    "The quality modifier is a multiplier from 0 to 1 that you multiply by the completeness score.\n",
+    "If there is 100% coverage for completion and it is all high quality, then you would return 7*1.\n",
+    "If there is 100% coverage for completion but it is all low quality, then you would return 7*0.5.\n",
+    "etc.\n",
+    "\"\"\"\n",
+    "\n",
+    "structured_output_grader_user_prompt = \"\"\"\n",
+    "<Code File>\n",
+    "{{item.input}}\n",
+    "</Code File>\n",
+    "\n",
+    "<Extracted Information>\n",
+    "{{sample.output_tools[0].function.arguments.symbols}}\n",
+    "</Extracted Information>\n",
+    "\"\"\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d7f66a56",
+   "metadata": {},
+   "source": [
+    "### Evals Creation\n",
+    "\n",
+    "Here we create an eval that will be used to evaluate the quality of extracted information from code files.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "95a5eaf6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "logs_eval = client.evals.create(\n",
+    "    name=\"Code QA Eval\",\n",
+    "    data_source_config={\n",
+    "        \"type\": \"custom\",\n",
+    "        \"item_schema\": {\"type\": \"object\", \"properties\": {\"input\": {\"type\": \"string\"}}},\n",
+    "        \"include_sample_schema\": True,\n",
+    "    },\n",
+    "    testing_criteria=[\n",
+    "        {\n",
+    "            \"type\": \"score_model\",\n",
+    "            \"name\": \"General Evaluator\",\n",
+    "            \"model\": \"o3\",\n",
+    "            \"input\": [\n",
+    "                {\"role\": \"system\", \"content\": structured_output_grader},\n",
+    "                {\"role\": \"user\", \"content\": structured_output_grader_user_prompt},\n",
+    "            ],\n",
+    "            \"range\": [1, 7],\n",
+    "            \"pass_threshold\": 5.0,\n",
+    "        }\n",
+    "    ],\n",
+    ")\n",
+    "\n",
+    "symbol_tool = {\n",
+    "    \"name\": \"extract_symbols\",\n",
+    "    \"description\": \"Extract the symbols from the code file\",\n",
+    "    \"parameters\": {\n",
+    "        \"type\": \"object\",\n",
+    "        \"properties\": {\n",
+    "            \"symbols\": {\n",
+    "                \"type\": \"array\",\n",
+    "                \"description\": \"A list of symbols extracted from Python code.\",\n",
+    "                \"items\": {\n",
+    "                    \"type\": \"object\",\n",
+    "                    \"properties\": {\n",
+    "                        \"name\": {\"type\": \"string\", \"description\": \"The name of the symbol.\"},\n",
+    "                        \"symbol_type\": {\"type\": \"string\", \"description\": \"The type of the symbol, e.g., variable, function, class.\"},\n",
+    "                    },\n",
+    "                    \"required\": [\"name\", \"symbol_type\"],\n",
+    "                    \"additionalProperties\": False,\n",
+    "                },\n",
+    "            }\n",
+    "        },\n",
+    "        \"required\": [\"symbols\"],\n",
+    "        \"additionalProperties\": False,\n",
+    "    },\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "73ae7e5e",
+   "metadata": {},
+   "source": [
+    "### Kick off model runs\n",
+    "Here we launch two runs against the same eval: one that calls the **Completions** endpoint, and one that calls the **Responses** endpoint."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0d650e02",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "gpt_4one_completions_run = client.evals.runs.create(\n",
+    "    name=\"gpt-4.1\",\n",
+    "    eval_id=logs_eval.id,\n",
+    "    data_source={\n",
+    "        \"type\": \"completions\",\n",
+    "        \"source\": {\"type\": \"file_content\", \"content\": [{\"item\": item} for item in get_dataset(limit=1)]},\n",
+    "        \"input_messages\": {\n",
+    "            \"type\": \"template\",\n",
+    "            \"template\": [\n",
+    "                {\"type\": \"message\", \"role\": \"system\", \"content\": {\"type\": \"input_text\", \"text\": \"You are a helpful assistant.\"}},\n",
+    "                {\"type\": \"message\", \"role\": \"user\", \"content\": {\"type\": \"input_text\", \"text\": \"Extract the symbols from the code file {{item.input}}\"}},\n",
+    "            ],\n",
+    "        },\n",
+    "        \"model\": \"gpt-4.1\",\n",
+    "        \"sampling_params\": {\n",
+    "            \"seed\": 42,\n",
+    "            \"temperature\": 0.7,\n",
+    "            \"max_completions_tokens\": 10000,\n",
+    "            \"top_p\": 0.9,\n",
+    "            \"tools\": [{\"type\": \"function\", \"function\": symbol_tool}],\n",
+    "        },\n",
+    "    },\n",
+    ")\n",
+    "\n",
+    "gpt_4one_responses_run = client.evals.runs.create(\n",
+    "    name=\"gpt-4.1-mini\",\n",
+    "    eval_id=logs_eval.id,\n",
+    "    data_source={\n",
+    "        \"type\": \"responses\",\n",
+    "        \"source\": {\"type\": \"file_content\", \"content\": [{\"item\": item} for item in get_dataset(limit=1)]},\n",
+    "        \"input_messages\": {\n",
+    "            \"type\": \"template\",\n",
+    "            \"template\": [\n",
+    "                {\"type\": \"message\", \"role\": \"system\", \"content\": {\"type\": \"input_text\", \"text\": \"You are a helpful assistant.\"}},\n",
+    "                {\"type\": \"message\", \"role\": \"user\", \"content\": {\"type\": \"input_text\", \"text\": \"Extract the symbols from the code file {{item.input}}\"}},\n",
+    "            ],\n",
+    "        },\n",
+    "        \"model\": \"gpt-4.1-mini\",\n",
+    "        \"sampling_params\": {\n",
+    "            \"seed\": 42,\n",
+    "            \"temperature\": 0.7,\n",
+    "            \"max_completions_tokens\": 10000,\n",
+    "            \"top_p\": 0.9,\n",
+    "            \"tools\": [{\"type\": \"function\", **symbol_tool}],\n",
+    "        },\n",
+    "    },\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6ea31f2a",
+   "metadata": {},
+   "source": [
+    "### Utility Poller\n",
+    "\n",
+    "We create a utility poller that will be used to poll for the results of the eval runs."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fb8f3df4",
+   "metadata": {},
+   "outputs": [
     {
-      "cell_type": "code",
-      "execution_count": 1,
-      "metadata": {},
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "evalrun_68437e5370c481919a6874594ca177d9 queued ResultCounts(errored=0, failed=0, passed=0, total=0)\n",
-            "evalrun_68437e544fe881918f76dbd8dce3fd15 queued ResultCounts(errored=0, failed=0, passed=0, total=0)\n",
-            "evalrun_68437e5370c481919a6874594ca177d9 in_progress ResultCounts(errored=0, failed=0, passed=0, total=0)\n",
-            "evalrun_68437e544fe881918f76dbd8dce3fd15 in_progress ResultCounts(errored=0, failed=0, passed=0, total=0)\n",
-            "evalrun_68437e5370c481919a6874594ca177d9 in_progress ResultCounts(errored=0, failed=0, passed=0, total=0)\n",
-            "evalrun_68437e544fe881918f76dbd8dce3fd15 in_progress ResultCounts(errored=0, failed=0, passed=0, total=0)\n",
-            "evalrun_68437e5370c481919a6874594ca177d9 in_progress ResultCounts(errored=0, failed=0, passed=0, total=0)\n",
-            "evalrun_68437e544fe881918f76dbd8dce3fd15 completed ResultCounts(errored=0, failed=0, passed=1, total=1)\n",
-            "evalrun_68437e5370c481919a6874594ca177d9 in_progress ResultCounts(errored=0, failed=0, passed=0, total=0)\n",
-            "evalrun_68437e544fe881918f76dbd8dce3fd15 completed ResultCounts(errored=0, failed=0, passed=1, total=1)\n",
-            "evalrun_68437e5370c481919a6874594ca177d9 completed ResultCounts(errored=0, failed=1, passed=0, total=1)\n",
-            "evalrun_68437e544fe881918f76dbd8dce3fd15 completed ResultCounts(errored=0, failed=0, passed=1, total=1)\n"
-          ]
-        }
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">evalrun_6848e2269570819198b757fe12b979da completed\n",
+       "<span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">ResultCounts</span><span style=\"font-weight: bold\">(</span><span style=\"color: #808000; text-decoration-color: #808000\">errored</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0</span>, <span style=\"color: #808000; text-decoration-color: #808000\">failed</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span>, <span style=\"color: #808000; text-decoration-color: #808000\">passed</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0</span>, <span style=\"color: #808000; text-decoration-color: #808000\">total</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span><span style=\"font-weight: bold\">)</span>\n",
+       "</pre>\n"
       ],
-      "source": [
-        "import os\n",
-        "import time\n",
-        "\n",
-        "import openai\n",
-        "\n",
-        "client = openai.OpenAI(\n",
-        "    api_key=os.getenv(\"OPENAI_API_KEY\") or os.getenv(\"_OPENAI_API_KEY\"),\n",
-        ")\n",
-        "\n",
-        "\n",
-        "def get_dataset(limit=None):\n",
-        "    openai_sdk_file_path = os.path.dirname(openai.__file__)\n",
-        "\n",
-        "    file_paths = [\n",
-        "        os.path.join(openai_sdk_file_path, \"resources\", \"evals\", \"evals.py\"),\n",
-        "        os.path.join(openai_sdk_file_path, \"resources\", \"responses\", \"responses.py\"),\n",
-        "        os.path.join(openai_sdk_file_path, \"resources\", \"images.py\"),\n",
-        "        os.path.join(openai_sdk_file_path, \"resources\", \"embeddings.py\"),\n",
-        "        os.path.join(openai_sdk_file_path, \"resources\", \"files.py\"),\n",
-        "    ]\n",
-        "\n",
-        "    items = []\n",
-        "    for file_path in file_paths:\n",
-        "        items.append({\"input\": open(file_path, \"r\").read()})\n",
-        "    if limit:\n",
-        "        return items[:limit]\n",
-        "    return items\n",
-        "\n",
-        "\n",
-        "structured_output_grader = \"\"\"\n",
-        "You are a helpful assistant that grades the quality of extracted information from a code file.\n",
-        "You will be given a code file and a list of extracted information.\n",
-        "You should grade the quality of the extracted information.\n",
-        "\n",
-        "You should grade the quality on a scale of 1 to 7.\n",
-        "You should apply the following criteria, and calculate your score as follows:\n",
-        "You should first check for completeness on a scale of 1 to 7.\n",
-        "Then you should apply a quality modifier.\n",
-        "\n",
-        "The quality modifier is a multiplier from 0 to 1 that you multiply by the completeness score.\n",
-        "If there is 100% coverage for completion and it is all high quality, then you would return 7*1.\n",
-        "If there is 100% coverage for completion but it is all low quality, then you would return 7*0.5.\n",
-        "etc.\n",
-        "\"\"\"\n",
-        "\n",
-        "structured_output_grader_user_prompt = \"\"\"\n",
-        "<Code File>\n",
-        "{{item.input}}\n",
-        "</Code File>\n",
-        "\n",
-        "<Extracted Information>\n",
-        "{{sample.output_tools[0].function.arguments.symbols}}\n",
-        "</Extracted Information>\n",
-        "\"\"\"\n",
-        "\n",
-        "logs_eval = client.evals.create(\n",
-        "    name=\"Code QA Eval\",\n",
-        "    data_source_config={\n",
-        "        \"type\": \"custom\",\n",
-        "        \"item_schema\": {\"type\": \"object\", \"properties\": {\"input\": {\"type\": \"string\"}}},\n",
-        "        \"include_sample_schema\": True,\n",
-        "    },\n",
-        "    testing_criteria=[\n",
-        "        {\n",
-        "            \"type\": \"score_model\",\n",
-        "            \"name\": \"General Evaluator\",\n",
-        "            \"model\": \"o3\",\n",
-        "            \"input\": [\n",
-        "                {\"role\": \"system\", \"content\": structured_output_grader},\n",
-        "                {\"role\": \"user\", \"content\": structured_output_grader_user_prompt},\n",
-        "            ],\n",
-        "            \"range\": [1, 7],\n",
-        "            \"pass_threshold\": 5.5,\n",
-        "        }\n",
-        "    ],\n",
-        ")\n",
-        "\n",
-        "symbol_tool = {\n",
-        "    \"name\": \"extract_symbols\",\n",
-        "    \"description\": \"Extract the symbols from the code file\",\n",
-        "    \"parameters\": {\n",
-        "        \"type\": \"object\",\n",
-        "        \"properties\": {\n",
-        "            \"symbols\": {\n",
-        "                \"type\": \"array\",\n",
-        "                \"description\": \"A list of symbols extracted from Python code.\",\n",
-        "                \"items\": {\n",
-        "                    \"type\": \"object\",\n",
-        "                    \"properties\": {\n",
-        "                        \"name\": {\"type\": \"string\", \"description\": \"The name of the symbol.\"},\n",
-        "                        \"symbol_type\": {\"type\": \"string\", \"description\": \"The type of the symbol, e.g., variable, function, class.\"},\n",
-        "                    },\n",
-        "                    \"required\": [\"name\", \"symbol_type\"],\n",
-        "                    \"additionalProperties\": False,\n",
-        "                },\n",
-        "            }\n",
-        "        },\n",
-        "        \"required\": [\"symbols\"],\n",
-        "        \"additionalProperties\": False,\n",
-        "    },\n",
-        "}\n",
-        "\n",
-        "gpt_4one_completions_run = client.evals.runs.create(\n",
-        "    name=\"gpt-4.1\",\n",
-        "    eval_id=logs_eval.id,\n",
-        "    data_source={\n",
-        "        \"type\": \"completions\",\n",
-        "        \"source\": {\"type\": \"file_content\", \"content\": [{\"item\": item} for item in get_dataset(limit=1)]},\n",
-        "        \"input_messages\": {\n",
-        "            \"type\": \"template\",\n",
-        "            \"template\": [\n",
-        "                {\"type\": \"message\", \"role\": \"system\", \"content\": {\"type\": \"input_text\", \"text\": \"You are a helpful assistant.\"}},\n",
-        "                {\"type\": \"message\", \"role\": \"user\", \"content\": {\"type\": \"input_text\", \"text\": \"Extract the symbols from the code file {{item.input}}\"}},\n",
-        "            ],\n",
-        "        },\n",
-        "        \"model\": \"gpt-4.1\",\n",
-        "        \"sampling_params\": {\n",
-        "            \"seed\": 42,\n",
-        "            \"temperature\": 0.7,\n",
-        "            \"max_completions_tokens\": 10000,\n",
-        "            \"top_p\": 0.9,\n",
-        "            \"tools\": [{\"type\": \"function\", \"function\": symbol_tool}],\n",
-        "        },\n",
-        "    },\n",
-        ")\n",
-        "\n",
-        "gpt_4one_responses_run = client.evals.runs.create(\n",
-        "    name=\"gpt-4.1\",\n",
-        "    eval_id=logs_eval.id,\n",
-        "    data_source={\n",
-        "        \"type\": \"responses\",\n",
-        "        \"source\": {\"type\": \"file_content\", \"content\": [{\"item\": item} for item in get_dataset(limit=1)]},\n",
-        "        \"input_messages\": {\n",
-        "            \"type\": \"template\",\n",
-        "            \"template\": [\n",
-        "                {\"type\": \"message\", \"role\": \"system\", \"content\": {\"type\": \"input_text\", \"text\": \"You are a helpful assistant.\"}},\n",
-        "                {\"type\": \"message\", \"role\": \"user\", \"content\": {\"type\": \"input_text\", \"text\": \"Extract the symbols from the code file {{item.input}}\"}},\n",
-        "            ],\n",
-        "        },\n",
-        "        \"model\": \"gpt-4.1\",\n",
-        "        \"sampling_params\": {\n",
-        "            \"seed\": 42,\n",
-        "            \"temperature\": 0.7,\n",
-        "            \"max_completions_tokens\": 10000,\n",
-        "            \"top_p\": 0.9,\n",
-        "            \"tools\": [{\"type\": \"function\", **symbol_tool}],\n",
-        "        },\n",
-        "    },\n",
-        ")\n",
-        "\n",
-        "\n",
-        "def poll_runs(eval_id, run_ids):\n",
-        "    # poll both runs at the same time, until they are complete or failed\n",
-        "    while True:\n",
-        "        runs = [client.evals.runs.retrieve(run_id, eval_id=eval_id) for run_id in run_ids]\n",
-        "        for run in runs:\n",
-        "            print(run.id, run.status, run.result_counts)\n",
-        "        if all(run.status in (\"completed\", \"failed\") for run in runs):\n",
-        "            break\n",
-        "        time.sleep(5)\n",
-        "\n",
-        "\n",
-        "poll_runs(logs_eval.id, [gpt_4one_completions_run.id, gpt_4one_responses_run.id])\n"
+      "text/plain": [
+       "evalrun_6848e2269570819198b757fe12b979da completed\n",
+       "\u001b[1;35mResultCounts\u001b[0m\u001b[1m(\u001b[0m\u001b[33merrored\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mfailed\u001b[0m=\u001b[1;36m1\u001b[0m, \u001b[33mpassed\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mtotal\u001b[0m=\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m\n"
       ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
     },
     {
-      "cell_type": "code",
-      "execution_count": 2,
-      "metadata": {},
-      "outputs": [],
-      "source": [
-        "completions_output = client.evals.runs.output_items.list(\n",
-        "    run_id=gpt_4one_completions_run.id, eval_id=logs_eval.id\n",
-        ")\n",
-        "\n",
-        "responses_output = client.evals.runs.output_items.list(\n",
-        "    run_id=gpt_4one_responses_run.id, eval_id=logs_eval.id\n",
-        ")\n"
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">evalrun_6848e227d3a481918a9b970c897b5998 completed\n",
+       "<span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">ResultCounts</span><span style=\"font-weight: bold\">(</span><span style=\"color: #808000; text-decoration-color: #808000\">errored</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0</span>, <span style=\"color: #808000; text-decoration-color: #808000\">failed</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span>, <span style=\"color: #808000; text-decoration-color: #808000\">passed</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0</span>, <span style=\"color: #808000; text-decoration-color: #808000\">total</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span><span style=\"font-weight: bold\">)</span>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "evalrun_6848e227d3a481918a9b970c897b5998 completed\n",
+       "\u001b[1;35mResultCounts\u001b[0m\u001b[1m(\u001b[0m\u001b[33merrored\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mfailed\u001b[0m=\u001b[1;36m1\u001b[0m, \u001b[33mpassed\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mtotal\u001b[0m=\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m\n"
       ]
-    },
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "def poll_runs(eval_id, run_ids):\n",
+    "    # poll both runs at the same time, until they are complete or failed\n",
+    "    while True:\n",
+    "        runs = [client.evals.runs.retrieve(run_id, eval_id=eval_id) for run_id in run_ids]\n",
+    "        for run in runs:\n",
+    "            print(run.id, run.status, run.result_counts)\n",
+    "        if all(run.status in (\"completed\", \"failed\") for run in runs):\n",
+    "            break\n",
+    "        time.sleep(5)\n",
+    "\n",
+    "\n",
+    "poll_runs(logs_eval.id, [gpt_4one_completions_run.id, gpt_4one_responses_run.id])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f4014cde",
+   "metadata": {
+    "tags": [
+     "original"
+    ]
+   },
+   "outputs": [],
+   "source": [
+    "\n",
+    "### Get Output\n",
+    "completions_output = client.evals.runs.output_items.list(\n",
+    "    run_id=gpt_4one_completions_run.id, eval_id=logs_eval.id\n",
+    ")\n",
+    "\n",
+    "responses_output = client.evals.runs.output_items.list(\n",
+    "    run_id=gpt_4one_responses_run.id, eval_id=logs_eval.id\n",
+    ")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "88ae7e17",
+   "metadata": {},
+   "source": [
+    "### Inspecting results<a name=\"Inspecting-results\"></a>\n",
+    "\n",
+    "For both completions and responses, we print the *symbols* dictionary that the model returned. You can diff this against the reference answer or compute precision / recall."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c0cddb6d",
+   "metadata": {
+    "tags": [
+     "original"
+    ]
+   },
+   "outputs": [
     {
-      "cell_type": "code",
-      "execution_count": 13,
-      "metadata": {},
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "[{'name': 'Evals', 'symbol_type': 'class'}, {'name': 'AsyncEvals', 'symbol_type': 'class'}, {'name': 'EvalsWithRawResponse', 'symbol_type': 'class'}, {'name': 'AsyncEvalsWithRawResponse', 'symbol_type': 'class'}, {'name': 'EvalsWithStreamingResponse', 'symbol_type': 'class'}, {'name': 'AsyncEvalsWithStreamingResponse', 'symbol_type': 'class'}, {'name': '__all__', 'symbol_type': 'variable'}, {'name': 'Evals.runs', 'symbol_type': 'function'}, {'name': 'Evals.with_raw_response', 'symbol_type': 'function'}, {'name': 'Evals.with_streaming_response', 'symbol_type': 'function'}, {'name': 'Evals.create', 'symbol_type': 'function'}, {'name': 'Evals.retrieve', 'symbol_type': 'function'}, {'name': 'Evals.update', 'symbol_type': 'function'}, {'name': 'Evals.list', 'symbol_type': 'function'}, {'name': 'Evals.delete', 'symbol_type': 'function'}, {'name': 'AsyncEvals.runs', 'symbol_type': 'function'}, {'name': 'AsyncEvals.with_raw_response', 'symbol_type': 'function'}, {'name': 'AsyncEvals.with_streaming_response', 'symbol_type': 'function'}, {'name': 'AsyncEvals.create', 'symbol_type': 'function'}, {'name': 'AsyncEvals.retrieve', 'symbol_type': 'function'}, {'name': 'AsyncEvals.update', 'symbol_type': 'function'}, {'name': 'AsyncEvals.list', 'symbol_type': 'function'}, {'name': 'AsyncEvals.delete', 'symbol_type': 'function'}, {'name': 'EvalsWithRawResponse.__init__', 'symbol_type': 'function'}, {'name': 'EvalsWithRawResponse.runs', 'symbol_type': 'function'}, {'name': 'AsyncEvalsWithRawResponse.__init__', 'symbol_type': 'function'}, {'name': 'AsyncEvalsWithRawResponse.runs', 'symbol_type': 'function'}, {'name': 'EvalsWithStreamingResponse.__init__', 'symbol_type': 'function'}, {'name': 'EvalsWithStreamingResponse.runs', 'symbol_type': 'function'}, {'name': 'AsyncEvalsWithStreamingResponse.__init__', 'symbol_type': 'function'}, {'name': 'AsyncEvalsWithStreamingResponse.runs', 'symbol_type': 'function'}]\n",
-            "[{'name': 'Evals', 'symbol_type': 'class'}, {'name': 'AsyncEvals', 'symbol_type': 'class'}, {'name': 'EvalsWithRawResponse', 'symbol_type': 'class'}, {'name': 'AsyncEvalsWithRawResponse', 'symbol_type': 'class'}, {'name': 'EvalsWithStreamingResponse', 'symbol_type': 'class'}, {'name': 'AsyncEvalsWithStreamingResponse', 'symbol_type': 'class'}, {'name': '__all__', 'symbol_type': 'variable'}]\n"
-          ]
-        }
+     "data": {
+      "text/html": [
+       "\n",
+       "<div style=\"margin-bottom:0.5em;margin-top:0.2em;\">\n",
+       "  <h4 style=\"color:#1CA7EC;font-weight:600;letter-spacing:0.5px;\n",
+       "     text-shadow:0 1px 2px rgba(0,0,0,0.06), 0 0px 0px #fff;font-size:1.05em;margin:0 0 0.35em 0;\">\n",
+       "    Completions vs Responses Output Symbols\n",
+       "  </h4>\n",
+       "  <table style=\"border-collapse:separate;border-spacing:0 0.2em;width:100%;border-radius:5px;overflow:hidden;box-shadow:0 1px 7px #BEE7FA22;\">\n",
+       "    <thead>\n",
+       "      <tr style=\"height:1.4em;\">\n",
+       "              <th style=\"width:50%;background:#323C50;color:#fff;font-size:1em;padding:6px 10px;border-bottom:2px solid #1CA7EC;text-align:center;\">Completions Output</th>\n",
+       "      <th style=\"width:50%;background:#323C50;color:#fff;font-size:1em;padding:6px 10px;border-bottom:2px solid #1CA7EC;text-align:center;\">Responses Output</th>\n",
+       "      </tr>\n",
+       "    </thead>\n",
+       "    <tbody>\n",
+       "      \n",
+       "      <tr style=\"height:1.2em;\">\n",
+       "          <td style=\"vertical-align:top; background:#F6F8FA; border-right:1px solid #E3E3E3; padding:2px 4px;\"><style type=\"text/css\">\n",
+       "#T_f295b th {\n",
+       "  font-size: 0.95em;\n",
+       "  background-color: #1CA7EC;\n",
+       "  color: #fff;\n",
+       "  border-bottom: 1px solid #18647E;\n",
+       "  padding: 2px 6px;\n",
+       "}\n",
+       "#T_f295b_row0_col0, #T_f295b_row0_col1, #T_f295b_row1_col0, #T_f295b_row1_col1, #T_f295b_row2_col0, #T_f295b_row2_col1, #T_f295b_row3_col0, #T_f295b_row3_col1, #T_f295b_row4_col0, #T_f295b_row4_col1, #T_f295b_row5_col0, #T_f295b_row5_col1, #T_f295b_row6_col0, #T_f295b_row6_col1 {\n",
+       "  white-space: pre-wrap;\n",
+       "  word-break: break-word;\n",
+       "  padding: 2px 6px;\n",
+       "  border: 1px solid #C3E7FA;\n",
+       "  font-size: 0.92em;\n",
+       "  background-color: #FDFEFF;\n",
+       "}\n",
+       "</style>\n",
+       "<table id=\"T_f295b\">\n",
+       "  <thead>\n",
+       "    <tr>\n",
+       "      <th id=\"T_f295b_level0_col0\" class=\"col_heading level0 col0\" >name</th>\n",
+       "      <th id=\"T_f295b_level0_col1\" class=\"col_heading level0 col1\" >symbol_type</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <td id=\"T_f295b_row0_col0\" class=\"data row0 col0\" >Evals</td>\n",
+       "      <td id=\"T_f295b_row0_col1\" class=\"data row0 col1\" >class</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_f295b_row1_col0\" class=\"data row1 col0\" >AsyncEvals</td>\n",
+       "      <td id=\"T_f295b_row1_col1\" class=\"data row1 col1\" >class</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_f295b_row2_col0\" class=\"data row2 col0\" >EvalsWithRawResponse</td>\n",
+       "      <td id=\"T_f295b_row2_col1\" class=\"data row2 col1\" >class</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_f295b_row3_col0\" class=\"data row3 col0\" >AsyncEvalsWithRawResponse</td>\n",
+       "      <td id=\"T_f295b_row3_col1\" class=\"data row3 col1\" >class</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_f295b_row4_col0\" class=\"data row4 col0\" >EvalsWithStreamingResponse</td>\n",
+       "      <td id=\"T_f295b_row4_col1\" class=\"data row4 col1\" >class</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_f295b_row5_col0\" class=\"data row5 col0\" >AsyncEvalsWithStreamingResponse</td>\n",
+       "      <td id=\"T_f295b_row5_col1\" class=\"data row5 col1\" >class</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_f295b_row6_col0\" class=\"data row6 col0\" >__all__</td>\n",
+       "      <td id=\"T_f295b_row6_col1\" class=\"data row6 col1\" >variable</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</td>\n",
+       "          <td style=\"vertical-align:top; background:#F6F8FA; padding:2px 4px;\"><style type=\"text/css\">\n",
+       "#T_c1589 th {\n",
+       "  font-size: 0.95em;\n",
+       "  background-color: #1CA7EC;\n",
+       "  color: #fff;\n",
+       "  border-bottom: 1px solid #18647E;\n",
+       "  padding: 2px 6px;\n",
+       "}\n",
+       "#T_c1589_row0_col0, #T_c1589_row0_col1, #T_c1589_row1_col0, #T_c1589_row1_col1, #T_c1589_row2_col0, #T_c1589_row2_col1, #T_c1589_row3_col0, #T_c1589_row3_col1, #T_c1589_row4_col0, #T_c1589_row4_col1, #T_c1589_row5_col0, #T_c1589_row5_col1, #T_c1589_row6_col0, #T_c1589_row6_col1, #T_c1589_row7_col0, #T_c1589_row7_col1, #T_c1589_row8_col0, #T_c1589_row8_col1, #T_c1589_row9_col0, #T_c1589_row9_col1, #T_c1589_row10_col0, #T_c1589_row10_col1, #T_c1589_row11_col0, #T_c1589_row11_col1, #T_c1589_row12_col0, #T_c1589_row12_col1, #T_c1589_row13_col0, #T_c1589_row13_col1, #T_c1589_row14_col0, #T_c1589_row14_col1, #T_c1589_row15_col0, #T_c1589_row15_col1, #T_c1589_row16_col0, #T_c1589_row16_col1, #T_c1589_row17_col0, #T_c1589_row17_col1, #T_c1589_row18_col0, #T_c1589_row18_col1, #T_c1589_row19_col0, #T_c1589_row19_col1, #T_c1589_row20_col0, #T_c1589_row20_col1, #T_c1589_row21_col0, #T_c1589_row21_col1, #T_c1589_row22_col0, #T_c1589_row22_col1, #T_c1589_row23_col0, #T_c1589_row23_col1, #T_c1589_row24_col0, #T_c1589_row24_col1, #T_c1589_row25_col0, #T_c1589_row25_col1, #T_c1589_row26_col0, #T_c1589_row26_col1, #T_c1589_row27_col0, #T_c1589_row27_col1, #T_c1589_row28_col0, #T_c1589_row28_col1, #T_c1589_row29_col0, #T_c1589_row29_col1 {\n",
+       "  white-space: pre-wrap;\n",
+       "  word-break: break-word;\n",
+       "  padding: 2px 6px;\n",
+       "  border: 1px solid #C3E7FA;\n",
+       "  font-size: 0.92em;\n",
+       "  background-color: #FDFEFF;\n",
+       "}\n",
+       "</style>\n",
+       "<table id=\"T_c1589\">\n",
+       "  <thead>\n",
+       "    <tr>\n",
+       "      <th id=\"T_c1589_level0_col0\" class=\"col_heading level0 col0\" >name</th>\n",
+       "      <th id=\"T_c1589_level0_col1\" class=\"col_heading level0 col1\" >symbol_type</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <td id=\"T_c1589_row0_col0\" class=\"data row0 col0\" >Evals</td>\n",
+       "      <td id=\"T_c1589_row0_col1\" class=\"data row0 col1\" >class</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_c1589_row1_col0\" class=\"data row1 col0\" >runs</td>\n",
+       "      <td id=\"T_c1589_row1_col1\" class=\"data row1 col1\" >function</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_c1589_row2_col0\" class=\"data row2 col0\" >with_raw_response</td>\n",
+       "      <td id=\"T_c1589_row2_col1\" class=\"data row2 col1\" >function</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_c1589_row3_col0\" class=\"data row3 col0\" >with_streaming_response</td>\n",
+       "      <td id=\"T_c1589_row3_col1\" class=\"data row3 col1\" >function</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_c1589_row4_col0\" class=\"data row4 col0\" >create</td>\n",
+       "      <td id=\"T_c1589_row4_col1\" class=\"data row4 col1\" >function</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_c1589_row5_col0\" class=\"data row5 col0\" >retrieve</td>\n",
+       "      <td id=\"T_c1589_row5_col1\" class=\"data row5 col1\" >function</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_c1589_row6_col0\" class=\"data row6 col0\" >update</td>\n",
+       "      <td id=\"T_c1589_row6_col1\" class=\"data row6 col1\" >function</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_c1589_row7_col0\" class=\"data row7 col0\" >list</td>\n",
+       "      <td id=\"T_c1589_row7_col1\" class=\"data row7 col1\" >function</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_c1589_row8_col0\" class=\"data row8 col0\" >delete</td>\n",
+       "      <td id=\"T_c1589_row8_col1\" class=\"data row8 col1\" >function</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_c1589_row9_col0\" class=\"data row9 col0\" >AsyncEvals</td>\n",
+       "      <td id=\"T_c1589_row9_col1\" class=\"data row9 col1\" >class</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_c1589_row10_col0\" class=\"data row10 col0\" >runs</td>\n",
+       "      <td id=\"T_c1589_row10_col1\" class=\"data row10 col1\" >function</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_c1589_row11_col0\" class=\"data row11 col0\" >with_raw_response</td>\n",
+       "      <td id=\"T_c1589_row11_col1\" class=\"data row11 col1\" >function</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_c1589_row12_col0\" class=\"data row12 col0\" >with_streaming_response</td>\n",
+       "      <td id=\"T_c1589_row12_col1\" class=\"data row12 col1\" >function</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_c1589_row13_col0\" class=\"data row13 col0\" >create</td>\n",
+       "      <td id=\"T_c1589_row13_col1\" class=\"data row13 col1\" >function</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_c1589_row14_col0\" class=\"data row14 col0\" >retrieve</td>\n",
+       "      <td id=\"T_c1589_row14_col1\" class=\"data row14 col1\" >function</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_c1589_row15_col0\" class=\"data row15 col0\" >update</td>\n",
+       "      <td id=\"T_c1589_row15_col1\" class=\"data row15 col1\" >function</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_c1589_row16_col0\" class=\"data row16 col0\" >list</td>\n",
+       "      <td id=\"T_c1589_row16_col1\" class=\"data row16 col1\" >function</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_c1589_row17_col0\" class=\"data row17 col0\" >delete</td>\n",
+       "      <td id=\"T_c1589_row17_col1\" class=\"data row17 col1\" >function</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_c1589_row18_col0\" class=\"data row18 col0\" >EvalsWithRawResponse</td>\n",
+       "      <td id=\"T_c1589_row18_col1\" class=\"data row18 col1\" >class</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_c1589_row19_col0\" class=\"data row19 col0\" >__init__</td>\n",
+       "      <td id=\"T_c1589_row19_col1\" class=\"data row19 col1\" >function</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_c1589_row20_col0\" class=\"data row20 col0\" >runs</td>\n",
+       "      <td id=\"T_c1589_row20_col1\" class=\"data row20 col1\" >function</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_c1589_row21_col0\" class=\"data row21 col0\" >AsyncEvalsWithRawResponse</td>\n",
+       "      <td id=\"T_c1589_row21_col1\" class=\"data row21 col1\" >class</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_c1589_row22_col0\" class=\"data row22 col0\" >__init__</td>\n",
+       "      <td id=\"T_c1589_row22_col1\" class=\"data row22 col1\" >function</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_c1589_row23_col0\" class=\"data row23 col0\" >runs</td>\n",
+       "      <td id=\"T_c1589_row23_col1\" class=\"data row23 col1\" >function</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_c1589_row24_col0\" class=\"data row24 col0\" >EvalsWithStreamingResponse</td>\n",
+       "      <td id=\"T_c1589_row24_col1\" class=\"data row24 col1\" >class</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_c1589_row25_col0\" class=\"data row25 col0\" >__init__</td>\n",
+       "      <td id=\"T_c1589_row25_col1\" class=\"data row25 col1\" >function</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_c1589_row26_col0\" class=\"data row26 col0\" >runs</td>\n",
+       "      <td id=\"T_c1589_row26_col1\" class=\"data row26 col1\" >function</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_c1589_row27_col0\" class=\"data row27 col0\" >AsyncEvalsWithStreamingResponse</td>\n",
+       "      <td id=\"T_c1589_row27_col1\" class=\"data row27 col1\" >class</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_c1589_row28_col0\" class=\"data row28 col0\" >__init__</td>\n",
+       "      <td id=\"T_c1589_row28_col1\" class=\"data row28 col1\" >function</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td id=\"T_c1589_row29_col0\" class=\"data row29 col0\" >runs</td>\n",
+       "      <td id=\"T_c1589_row29_col1\" class=\"data row29 col1\" >function</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</td>\n",
+       "      </tr>\n",
+       "    \n",
+       "    </tbody>\n",
+       "  </table>\n",
+       "</div>\n"
       ],
-      "source": [
-        "import json\n",
-        "\n",
-        "for item in completions_output:\n",
-        "    print(json.loads(item.sample.output[0].tool_calls[0][\"function\"][\"arguments\"])[\"symbols\"])\n",
-        "\n",
-        "for item in responses_output:\n",
-        "    print(json.loads(item.sample.output[0].tool_calls[0][\"function\"][\"arguments\"])[\"symbols\"])\n"
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
       ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
     }
-  ],
-  "metadata": {
-    "kernelspec": {
-      "display_name": "openai",
-      "language": "python",
-      "name": "python3"
-    },
-    "language_info": {
-      "codemirror_mode": {
-        "name": "ipython",
-        "version": 3
-      },
-      "file_extension": ".py",
-      "mimetype": "text/x-python",
-      "name": "python",
-      "nbconvert_exporter": "python",
-      "pygments_lexer": "ipython3",
-      "version": "3.12.9"
-    }
+   ],
+   "source": [
+    "import json\n",
+    "import pandas as pd\n",
+    "from IPython.display import display, HTML\n",
+    "\n",
+    "def extract_symbols(output_list):\n",
+    "    symbols_list = []\n",
+    "    for item in output_list:\n",
+    "        try:\n",
+    "            args = item.sample.output[0].tool_calls[0][\"function\"][\"arguments\"]\n",
+    "            symbols = json.loads(args)[\"symbols\"]\n",
+    "            symbols_list.append(symbols)\n",
+    "        except Exception as e:\n",
+    "            symbols_list.append([{\"error\": str(e)}])\n",
+    "    return symbols_list\n",
+    "\n",
+    "completions_symbols = extract_symbols(completions_output)\n",
+    "responses_symbols = extract_symbols(responses_output)\n",
+    "\n",
+    "def symbols_to_html_table(symbols):\n",
+    "    if symbols and isinstance(symbols, list):\n",
+    "        df = pd.DataFrame(symbols)\n",
+    "        return (\n",
+    "            df.style\n",
+    "            .set_properties(**{\n",
+    "                'white-space': 'pre-wrap',\n",
+    "                'word-break': 'break-word',\n",
+    "                'padding': '2px 6px',\n",
+    "                'border': '1px solid #C3E7FA',\n",
+    "                'font-size': '0.92em',\n",
+    "                'background-color': '#FDFEFF'\n",
+    "            })\n",
+    "            .set_table_styles([{\n",
+    "                'selector': 'th',\n",
+    "                'props': [\n",
+    "                    ('font-size', '0.95em'),\n",
+    "                    ('background-color', '#1CA7EC'),\n",
+    "                    ('color', '#fff'),\n",
+    "                    ('border-bottom', '1px solid #18647E'),\n",
+    "                    ('padding', '2px 6px')\n",
+    "                ]\n",
+    "            }])\n",
+    "            .hide(axis='index')\n",
+    "            .to_html()\n",
+    "        )\n",
+    "    return f\"<div style='padding:4px 0;color:#D9534F;font-style:italic;font-size:0.9em'>{str(symbols)}</div>\"\n",
+    "\n",
+    "table_rows = []\n",
+    "max_len = max(len(completions_symbols), len(responses_symbols))\n",
+    "for i in range(max_len):\n",
+    "    c_html = symbols_to_html_table(completions_symbols[i]) if i < len(completions_symbols) else \"\"\n",
+    "    r_html = symbols_to_html_table(responses_symbols[i]) if i < len(responses_symbols) else \"\"\n",
+    "    table_rows.append(f\"\"\"\n",
+    "      <tr style=\"height:1.2em;\">\n",
+    "          <td style=\"vertical-align:top; background:#F6F8FA; border-right:1px solid #E3E3E3; padding:2px 4px;\">{c_html}</td>\n",
+    "          <td style=\"vertical-align:top; background:#F6F8FA; padding:2px 4px;\">{r_html}</td>\n",
+    "      </tr>\n",
+    "    \"\"\")\n",
+    "\n",
+    "table_html = f\"\"\"\n",
+    "<div style=\"margin-bottom:0.5em;margin-top:0.2em;\">\n",
+    "  <h4 style=\"color:#1CA7EC;font-weight:600;letter-spacing:0.5px;\n",
+    "     text-shadow:0 1px 2px rgba(0,0,0,0.06), 0 0px 0px #fff;font-size:1.05em;margin:0 0 0.35em 0;\">\n",
+    "    Completions vs Responses Output Symbols\n",
+    "  </h4>\n",
+    "  <table style=\"border-collapse:separate;border-spacing:0 0.2em;width:100%;border-radius:5px;overflow:hidden;box-shadow:0 1px 7px #BEE7FA22;\">\n",
+    "    <thead>\n",
+    "      <tr style=\"height:1.4em;\">\n",
+    "              <th style=\"width:50%;background:#323C50;color:#fff;font-size:1em;padding:6px 10px;border-bottom:2px solid #1CA7EC;text-align:center;\">Completions Output</th>\n",
+    "      <th style=\"width:50%;background:#323C50;color:#fff;font-size:1em;padding:6px 10px;border-bottom:2px solid #1CA7EC;text-align:center;\">Responses Output</th>\n",
+    "      </tr>\n",
+    "    </thead>\n",
+    "    <tbody>\n",
+    "      {''.join(table_rows)}\n",
+    "    </tbody>\n",
+    "  </table>\n",
+    "</div>\n",
+    "\"\"\"\n",
+    "\n",
+    "display(HTML(table_html))\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e8e4ca5a",
+   "metadata": {},
+   "source": [
+    "### Visualize Evals Dashboard\n",
+    "\n",
+    "You can navigate to the Evals Dashboard in order to visualize the data.\n",
+    "\n",
+    "\n",
+    "![evals_tool_dashboard](../../../images/evals_tool_dashboard.png)\n",
+    "\n",
+    "\n",
+    "You can also take a look at the explanation of the failed results in the Evals Dashboard after the run is complete as shown in the image below.\n",
+    "\n",
+    "![evals_tool_failed](../../../images/eval_tools_fail.png)\n",
+    "\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "50ad84ad",
+   "metadata": {},
+   "source": [
+    "This notebook demonstrated how to use OpenAI Evals to assess and improve a model’s ability to extract structured information from Python code using tool calls. \n",
+    "\n",
+    "\n",
+    "OpenAI Evals provides a robust, reproducible framework for evaluating LLMs on structured extraction tasks. By combining clear tool schemas, rigorous grading rubrics, and well-structured datasets, you can measure and improve overall performance.\n",
+    "\n",
+    "*For more details, see the [OpenAI Evals documentation](https://platform.openai.com/docs/guides/evals).*"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
   },
-  "nbformat": 4,
-  "nbformat_minor": 2
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
 }
diff --git a/images/eval_tools_fail.png b/images/eval_tools_fail.png
new file mode 100644
index 0000000000..b3e0ba49e7
Binary files /dev/null and b/images/eval_tools_fail.png differ
diff --git a/images/evals_tool_dashboard.png b/images/evals_tool_dashboard.png
new file mode 100644
index 0000000000..77c9338486
Binary files /dev/null and b/images/evals_tool_dashboard.png differ
diff --git a/registry.yaml b/registry.yaml
index 552b5fef48..c821bd077c 100644
--- a/registry.yaml
+++ b/registry.yaml
@@ -2158,6 +2158,7 @@
   date: 2025-06-09
   authors:
     - josiah-openai
+    - shikhar-cyber
   tags:
     - evals-api
     - responses