diff --git a/examples/evaluation/use-cases/tools-evaluation.ipynb b/examples/evaluation/use-cases/tools-evaluation.ipynb
index cd5c72b52e..5bdf49829c 100644
--- a/examples/evaluation/use-cases/tools-evaluation.ipynb
+++ b/examples/evaluation/use-cases/tools-evaluation.ipynb
@@ -1,268 +1,736 @@
{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Evaluating Code Symbol Extraction Quality with a Custom Dataset"
- ]
- },
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "6ff95379",
+ "metadata": {},
+ "source": [
+ "# Tool Evaluation with OpenAI Evals\n",
+ "\n",
+ "This cookbook shows how to **measure and improve a model’s ability to extract structured information from source code** with tool evaluation. In this case, the set of *symbols* (functions, classes, methods, and variables) defined in Python files. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4cc30394",
+ "metadata": {},
+ "source": [
+ "## Setup\n",
+ "\n",
+ "Install the latest **openai** Python package ≥ 1.14.0 and set your `OPENAI_API_KEY` environment variable. If you also want to evaluate an *assistant with tools*, enable the *Assistants v2 beta* in your account.\n",
+ "\n",
+ "```bash\n",
+ "pip install --upgrade openai\n",
+ "export OPENAI_API_KEY=sk‑...\n",
+ "```\n",
+ "Below we import the SDK, create a client, and define a helper that builds a small dataset from files inside the **openai** package itself."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "id": "acd0d746",
+ "metadata": {},
+ "outputs": [
{
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "This notebook demonstrates how to evaluate a model's ability to extract symbols from code files using the OpenAI **Evals** framework with a custom in-memory dataset."
- ]
- },
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n",
+ "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.1.1\u001b[0m\n",
+ "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n",
+ "Note: you may need to restart the kernel to use updated packages.\n"
+ ]
+ }
+ ],
+ "source": [
+ "%pip install --upgrade openai pandas jinja2 rich --quiet\n",
+ "\n",
+ "import os\n",
+ "import time\n",
+ "import openai\n",
+ "from rich import print\n",
+ "\n",
+ "client = openai.OpenAI(\n",
+ " api_key=os.getenv(\"OPENAI_API_KEY\") or os.getenv(\"_OPENAI_API_KEY\"),\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "80618b60",
+ "metadata": {},
+ "source": [
+ "### Dataset factory & grading rubric\n",
+ "* `get_dataset` builds a small in-memory dataset by reading several SDK files.\n",
+ "* `structured_output_grader` defines a detailed evaluation rubric. \n",
+ "* `sampled.output_tools[0].function.arguments.symbols` specifies the extracted symbols from the code file based on the tool invocation.\n",
+ "* `client.evals.create(...)` registers the eval with the platform."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "120b6e4d",
+ "metadata": {
+ "tags": [
+ "original"
+ ]
+ },
+ "outputs": [],
+ "source": [
+ "def get_dataset(limit=None):\n",
+ " openai_sdk_file_path = os.path.dirname(openai.__file__)\n",
+ "\n",
+ " file_paths = [\n",
+ " os.path.join(openai_sdk_file_path, \"resources\", \"evals\", \"evals.py\"),\n",
+ " os.path.join(openai_sdk_file_path, \"resources\", \"responses\", \"responses.py\"),\n",
+ " os.path.join(openai_sdk_file_path, \"resources\", \"images.py\"),\n",
+ " os.path.join(openai_sdk_file_path, \"resources\", \"embeddings.py\"),\n",
+ " os.path.join(openai_sdk_file_path, \"resources\", \"files.py\"),\n",
+ " ]\n",
+ "\n",
+ " items = []\n",
+ " for file_path in file_paths:\n",
+ " items.append({\"input\": open(file_path, \"r\").read()})\n",
+ " if limit:\n",
+ " return items[:limit]\n",
+ " return items\n",
+ "\n",
+ "\n",
+ "structured_output_grader = \"\"\"\n",
+ "You are a helpful assistant that grades the quality of extracted information from a code file.\n",
+ "You will be given a code file and a list of extracted information.\n",
+ "You should grade the quality of the extracted information.\n",
+ "\n",
+ "You should grade the quality on a scale of 1 to 7.\n",
+ "You should apply the following criteria, and calculate your score as follows:\n",
+ "You should first check for completeness on a scale of 1 to 7.\n",
+ "Then you should apply a quality modifier.\n",
+ "\n",
+ "The quality modifier is a multiplier from 0 to 1 that you multiply by the completeness score.\n",
+ "If there is 100% coverage for completion and it is all high quality, then you would return 7*1.\n",
+ "If there is 100% coverage for completion but it is all low quality, then you would return 7*0.5.\n",
+ "etc.\n",
+ "\"\"\"\n",
+ "\n",
+ "structured_output_grader_user_prompt = \"\"\"\n",
+ "\n",
+ "{{item.input}}\n",
+ "
\n",
+ "\n",
+ "\n",
+ "{{sample.output_tools[0].function.arguments.symbols}}\n",
+ "\n",
+ "\"\"\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d7f66a56",
+ "metadata": {},
+ "source": [
+ "### Evals Creation\n",
+ "\n",
+ "Here we create an eval that will be used to evaluate the quality of extracted information from code files.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "95a5eaf6",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "logs_eval = client.evals.create(\n",
+ " name=\"Code QA Eval\",\n",
+ " data_source_config={\n",
+ " \"type\": \"custom\",\n",
+ " \"item_schema\": {\"type\": \"object\", \"properties\": {\"input\": {\"type\": \"string\"}}},\n",
+ " \"include_sample_schema\": True,\n",
+ " },\n",
+ " testing_criteria=[\n",
+ " {\n",
+ " \"type\": \"score_model\",\n",
+ " \"name\": \"General Evaluator\",\n",
+ " \"model\": \"o3\",\n",
+ " \"input\": [\n",
+ " {\"role\": \"system\", \"content\": structured_output_grader},\n",
+ " {\"role\": \"user\", \"content\": structured_output_grader_user_prompt},\n",
+ " ],\n",
+ " \"range\": [1, 7],\n",
+ " \"pass_threshold\": 5.0,\n",
+ " }\n",
+ " ],\n",
+ ")\n",
+ "\n",
+ "symbol_tool = {\n",
+ " \"name\": \"extract_symbols\",\n",
+ " \"description\": \"Extract the symbols from the code file\",\n",
+ " \"parameters\": {\n",
+ " \"type\": \"object\",\n",
+ " \"properties\": {\n",
+ " \"symbols\": {\n",
+ " \"type\": \"array\",\n",
+ " \"description\": \"A list of symbols extracted from Python code.\",\n",
+ " \"items\": {\n",
+ " \"type\": \"object\",\n",
+ " \"properties\": {\n",
+ " \"name\": {\"type\": \"string\", \"description\": \"The name of the symbol.\"},\n",
+ " \"symbol_type\": {\"type\": \"string\", \"description\": \"The type of the symbol, e.g., variable, function, class.\"},\n",
+ " },\n",
+ " \"required\": [\"name\", \"symbol_type\"],\n",
+ " \"additionalProperties\": False,\n",
+ " },\n",
+ " }\n",
+ " },\n",
+ " \"required\": [\"symbols\"],\n",
+ " \"additionalProperties\": False,\n",
+ " },\n",
+ "}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "73ae7e5e",
+ "metadata": {},
+ "source": [
+ "### Kick off model runs\n",
+ "Here we launch two runs against the same eval: one that calls the **Completions** endpoint, and one that calls the **Responses** endpoint."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "0d650e02",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "gpt_4one_completions_run = client.evals.runs.create(\n",
+ " name=\"gpt-4.1\",\n",
+ " eval_id=logs_eval.id,\n",
+ " data_source={\n",
+ " \"type\": \"completions\",\n",
+ " \"source\": {\"type\": \"file_content\", \"content\": [{\"item\": item} for item in get_dataset(limit=1)]},\n",
+ " \"input_messages\": {\n",
+ " \"type\": \"template\",\n",
+ " \"template\": [\n",
+ " {\"type\": \"message\", \"role\": \"system\", \"content\": {\"type\": \"input_text\", \"text\": \"You are a helpful assistant.\"}},\n",
+ " {\"type\": \"message\", \"role\": \"user\", \"content\": {\"type\": \"input_text\", \"text\": \"Extract the symbols from the code file {{item.input}}\"}},\n",
+ " ],\n",
+ " },\n",
+ " \"model\": \"gpt-4.1\",\n",
+ " \"sampling_params\": {\n",
+ " \"seed\": 42,\n",
+ " \"temperature\": 0.7,\n",
+ " \"max_completions_tokens\": 10000,\n",
+ " \"top_p\": 0.9,\n",
+ " \"tools\": [{\"type\": \"function\", \"function\": symbol_tool}],\n",
+ " },\n",
+ " },\n",
+ ")\n",
+ "\n",
+ "gpt_4one_responses_run = client.evals.runs.create(\n",
+ " name=\"gpt-4.1-mini\",\n",
+ " eval_id=logs_eval.id,\n",
+ " data_source={\n",
+ " \"type\": \"responses\",\n",
+ " \"source\": {\"type\": \"file_content\", \"content\": [{\"item\": item} for item in get_dataset(limit=1)]},\n",
+ " \"input_messages\": {\n",
+ " \"type\": \"template\",\n",
+ " \"template\": [\n",
+ " {\"type\": \"message\", \"role\": \"system\", \"content\": {\"type\": \"input_text\", \"text\": \"You are a helpful assistant.\"}},\n",
+ " {\"type\": \"message\", \"role\": \"user\", \"content\": {\"type\": \"input_text\", \"text\": \"Extract the symbols from the code file {{item.input}}\"}},\n",
+ " ],\n",
+ " },\n",
+ " \"model\": \"gpt-4.1-mini\",\n",
+ " \"sampling_params\": {\n",
+ " \"seed\": 42,\n",
+ " \"temperature\": 0.7,\n",
+ " \"max_completions_tokens\": 10000,\n",
+ " \"top_p\": 0.9,\n",
+ " \"tools\": [{\"type\": \"function\", **symbol_tool}],\n",
+ " },\n",
+ " },\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6ea31f2a",
+ "metadata": {},
+ "source": [
+ "### Utility Poller\n",
+ "\n",
+ "We create a utility poller that will be used to poll for the results of the eval runs."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "fb8f3df4",
+ "metadata": {},
+ "outputs": [
{
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "evalrun_68437e5370c481919a6874594ca177d9 queued ResultCounts(errored=0, failed=0, passed=0, total=0)\n",
- "evalrun_68437e544fe881918f76dbd8dce3fd15 queued ResultCounts(errored=0, failed=0, passed=0, total=0)\n",
- "evalrun_68437e5370c481919a6874594ca177d9 in_progress ResultCounts(errored=0, failed=0, passed=0, total=0)\n",
- "evalrun_68437e544fe881918f76dbd8dce3fd15 in_progress ResultCounts(errored=0, failed=0, passed=0, total=0)\n",
- "evalrun_68437e5370c481919a6874594ca177d9 in_progress ResultCounts(errored=0, failed=0, passed=0, total=0)\n",
- "evalrun_68437e544fe881918f76dbd8dce3fd15 in_progress ResultCounts(errored=0, failed=0, passed=0, total=0)\n",
- "evalrun_68437e5370c481919a6874594ca177d9 in_progress ResultCounts(errored=0, failed=0, passed=0, total=0)\n",
- "evalrun_68437e544fe881918f76dbd8dce3fd15 completed ResultCounts(errored=0, failed=0, passed=1, total=1)\n",
- "evalrun_68437e5370c481919a6874594ca177d9 in_progress ResultCounts(errored=0, failed=0, passed=0, total=0)\n",
- "evalrun_68437e544fe881918f76dbd8dce3fd15 completed ResultCounts(errored=0, failed=0, passed=1, total=1)\n",
- "evalrun_68437e5370c481919a6874594ca177d9 completed ResultCounts(errored=0, failed=1, passed=0, total=1)\n",
- "evalrun_68437e544fe881918f76dbd8dce3fd15 completed ResultCounts(errored=0, failed=0, passed=1, total=1)\n"
- ]
- }
+ "data": {
+ "text/html": [
+ "
evalrun_6848e2269570819198b757fe12b979da completed\n",
+ "ResultCounts(errored=0, failed=1, passed=0, total=1)\n",
+ "
\n"
],
- "source": [
- "import os\n",
- "import time\n",
- "\n",
- "import openai\n",
- "\n",
- "client = openai.OpenAI(\n",
- " api_key=os.getenv(\"OPENAI_API_KEY\") or os.getenv(\"_OPENAI_API_KEY\"),\n",
- ")\n",
- "\n",
- "\n",
- "def get_dataset(limit=None):\n",
- " openai_sdk_file_path = os.path.dirname(openai.__file__)\n",
- "\n",
- " file_paths = [\n",
- " os.path.join(openai_sdk_file_path, \"resources\", \"evals\", \"evals.py\"),\n",
- " os.path.join(openai_sdk_file_path, \"resources\", \"responses\", \"responses.py\"),\n",
- " os.path.join(openai_sdk_file_path, \"resources\", \"images.py\"),\n",
- " os.path.join(openai_sdk_file_path, \"resources\", \"embeddings.py\"),\n",
- " os.path.join(openai_sdk_file_path, \"resources\", \"files.py\"),\n",
- " ]\n",
- "\n",
- " items = []\n",
- " for file_path in file_paths:\n",
- " items.append({\"input\": open(file_path, \"r\").read()})\n",
- " if limit:\n",
- " return items[:limit]\n",
- " return items\n",
- "\n",
- "\n",
- "structured_output_grader = \"\"\"\n",
- "You are a helpful assistant that grades the quality of extracted information from a code file.\n",
- "You will be given a code file and a list of extracted information.\n",
- "You should grade the quality of the extracted information.\n",
- "\n",
- "You should grade the quality on a scale of 1 to 7.\n",
- "You should apply the following criteria, and calculate your score as follows:\n",
- "You should first check for completeness on a scale of 1 to 7.\n",
- "Then you should apply a quality modifier.\n",
- "\n",
- "The quality modifier is a multiplier from 0 to 1 that you multiply by the completeness score.\n",
- "If there is 100% coverage for completion and it is all high quality, then you would return 7*1.\n",
- "If there is 100% coverage for completion but it is all low quality, then you would return 7*0.5.\n",
- "etc.\n",
- "\"\"\"\n",
- "\n",
- "structured_output_grader_user_prompt = \"\"\"\n",
- "\n",
- "{{item.input}}\n",
- "
\n",
- "\n",
- "\n",
- "{{sample.output_tools[0].function.arguments.symbols}}\n",
- "\n",
- "\"\"\"\n",
- "\n",
- "logs_eval = client.evals.create(\n",
- " name=\"Code QA Eval\",\n",
- " data_source_config={\n",
- " \"type\": \"custom\",\n",
- " \"item_schema\": {\"type\": \"object\", \"properties\": {\"input\": {\"type\": \"string\"}}},\n",
- " \"include_sample_schema\": True,\n",
- " },\n",
- " testing_criteria=[\n",
- " {\n",
- " \"type\": \"score_model\",\n",
- " \"name\": \"General Evaluator\",\n",
- " \"model\": \"o3\",\n",
- " \"input\": [\n",
- " {\"role\": \"system\", \"content\": structured_output_grader},\n",
- " {\"role\": \"user\", \"content\": structured_output_grader_user_prompt},\n",
- " ],\n",
- " \"range\": [1, 7],\n",
- " \"pass_threshold\": 5.5,\n",
- " }\n",
- " ],\n",
- ")\n",
- "\n",
- "symbol_tool = {\n",
- " \"name\": \"extract_symbols\",\n",
- " \"description\": \"Extract the symbols from the code file\",\n",
- " \"parameters\": {\n",
- " \"type\": \"object\",\n",
- " \"properties\": {\n",
- " \"symbols\": {\n",
- " \"type\": \"array\",\n",
- " \"description\": \"A list of symbols extracted from Python code.\",\n",
- " \"items\": {\n",
- " \"type\": \"object\",\n",
- " \"properties\": {\n",
- " \"name\": {\"type\": \"string\", \"description\": \"The name of the symbol.\"},\n",
- " \"symbol_type\": {\"type\": \"string\", \"description\": \"The type of the symbol, e.g., variable, function, class.\"},\n",
- " },\n",
- " \"required\": [\"name\", \"symbol_type\"],\n",
- " \"additionalProperties\": False,\n",
- " },\n",
- " }\n",
- " },\n",
- " \"required\": [\"symbols\"],\n",
- " \"additionalProperties\": False,\n",
- " },\n",
- "}\n",
- "\n",
- "gpt_4one_completions_run = client.evals.runs.create(\n",
- " name=\"gpt-4.1\",\n",
- " eval_id=logs_eval.id,\n",
- " data_source={\n",
- " \"type\": \"completions\",\n",
- " \"source\": {\"type\": \"file_content\", \"content\": [{\"item\": item} for item in get_dataset(limit=1)]},\n",
- " \"input_messages\": {\n",
- " \"type\": \"template\",\n",
- " \"template\": [\n",
- " {\"type\": \"message\", \"role\": \"system\", \"content\": {\"type\": \"input_text\", \"text\": \"You are a helpful assistant.\"}},\n",
- " {\"type\": \"message\", \"role\": \"user\", \"content\": {\"type\": \"input_text\", \"text\": \"Extract the symbols from the code file {{item.input}}\"}},\n",
- " ],\n",
- " },\n",
- " \"model\": \"gpt-4.1\",\n",
- " \"sampling_params\": {\n",
- " \"seed\": 42,\n",
- " \"temperature\": 0.7,\n",
- " \"max_completions_tokens\": 10000,\n",
- " \"top_p\": 0.9,\n",
- " \"tools\": [{\"type\": \"function\", \"function\": symbol_tool}],\n",
- " },\n",
- " },\n",
- ")\n",
- "\n",
- "gpt_4one_responses_run = client.evals.runs.create(\n",
- " name=\"gpt-4.1\",\n",
- " eval_id=logs_eval.id,\n",
- " data_source={\n",
- " \"type\": \"responses\",\n",
- " \"source\": {\"type\": \"file_content\", \"content\": [{\"item\": item} for item in get_dataset(limit=1)]},\n",
- " \"input_messages\": {\n",
- " \"type\": \"template\",\n",
- " \"template\": [\n",
- " {\"type\": \"message\", \"role\": \"system\", \"content\": {\"type\": \"input_text\", \"text\": \"You are a helpful assistant.\"}},\n",
- " {\"type\": \"message\", \"role\": \"user\", \"content\": {\"type\": \"input_text\", \"text\": \"Extract the symbols from the code file {{item.input}}\"}},\n",
- " ],\n",
- " },\n",
- " \"model\": \"gpt-4.1\",\n",
- " \"sampling_params\": {\n",
- " \"seed\": 42,\n",
- " \"temperature\": 0.7,\n",
- " \"max_completions_tokens\": 10000,\n",
- " \"top_p\": 0.9,\n",
- " \"tools\": [{\"type\": \"function\", **symbol_tool}],\n",
- " },\n",
- " },\n",
- ")\n",
- "\n",
- "\n",
- "def poll_runs(eval_id, run_ids):\n",
- " # poll both runs at the same time, until they are complete or failed\n",
- " while True:\n",
- " runs = [client.evals.runs.retrieve(run_id, eval_id=eval_id) for run_id in run_ids]\n",
- " for run in runs:\n",
- " print(run.id, run.status, run.result_counts)\n",
- " if all(run.status in (\"completed\", \"failed\") for run in runs):\n",
- " break\n",
- " time.sleep(5)\n",
- "\n",
- "\n",
- "poll_runs(logs_eval.id, [gpt_4one_completions_run.id, gpt_4one_responses_run.id])\n"
+ "text/plain": [
+ "evalrun_6848e2269570819198b757fe12b979da completed\n",
+ "\u001b[1;35mResultCounts\u001b[0m\u001b[1m(\u001b[0m\u001b[33merrored\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mfailed\u001b[0m=\u001b[1;36m1\u001b[0m, \u001b[33mpassed\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mtotal\u001b[0m=\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m\n"
]
+ },
+ "metadata": {},
+ "output_type": "display_data"
},
{
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {},
- "outputs": [],
- "source": [
- "completions_output = client.evals.runs.output_items.list(\n",
- " run_id=gpt_4one_completions_run.id, eval_id=logs_eval.id\n",
- ")\n",
- "\n",
- "responses_output = client.evals.runs.output_items.list(\n",
- " run_id=gpt_4one_responses_run.id, eval_id=logs_eval.id\n",
- ")\n"
+ "data": {
+ "text/html": [
+ "evalrun_6848e227d3a481918a9b970c897b5998 completed\n",
+ "ResultCounts(errored=0, failed=1, passed=0, total=1)\n",
+ "
\n"
+ ],
+ "text/plain": [
+ "evalrun_6848e227d3a481918a9b970c897b5998 completed\n",
+ "\u001b[1;35mResultCounts\u001b[0m\u001b[1m(\u001b[0m\u001b[33merrored\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mfailed\u001b[0m=\u001b[1;36m1\u001b[0m, \u001b[33mpassed\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mtotal\u001b[0m=\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m\n"
]
- },
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "def poll_runs(eval_id, run_ids):\n",
+ " # poll both runs at the same time, until they are complete or failed\n",
+ " while True:\n",
+ " runs = [client.evals.runs.retrieve(run_id, eval_id=eval_id) for run_id in run_ids]\n",
+ " for run in runs:\n",
+ " print(run.id, run.status, run.result_counts)\n",
+ " if all(run.status in (\"completed\", \"failed\") for run in runs):\n",
+ " break\n",
+ " time.sleep(5)\n",
+ "\n",
+ "\n",
+ "poll_runs(logs_eval.id, [gpt_4one_completions_run.id, gpt_4one_responses_run.id])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f4014cde",
+ "metadata": {
+ "tags": [
+ "original"
+ ]
+ },
+ "outputs": [],
+ "source": [
+ "\n",
+ "### Get Output\n",
+ "completions_output = client.evals.runs.output_items.list(\n",
+ " run_id=gpt_4one_completions_run.id, eval_id=logs_eval.id\n",
+ ")\n",
+ "\n",
+ "responses_output = client.evals.runs.output_items.list(\n",
+ " run_id=gpt_4one_responses_run.id, eval_id=logs_eval.id\n",
+ ")\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "88ae7e17",
+ "metadata": {},
+ "source": [
+ "### Inspecting results\n",
+ "\n",
+ "For both completions and responses, we print the *symbols* dictionary that the model returned. You can diff this against the reference answer or compute precision / recall."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c0cddb6d",
+ "metadata": {
+ "tags": [
+ "original"
+ ]
+ },
+ "outputs": [
{
- "cell_type": "code",
- "execution_count": 13,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[{'name': 'Evals', 'symbol_type': 'class'}, {'name': 'AsyncEvals', 'symbol_type': 'class'}, {'name': 'EvalsWithRawResponse', 'symbol_type': 'class'}, {'name': 'AsyncEvalsWithRawResponse', 'symbol_type': 'class'}, {'name': 'EvalsWithStreamingResponse', 'symbol_type': 'class'}, {'name': 'AsyncEvalsWithStreamingResponse', 'symbol_type': 'class'}, {'name': '__all__', 'symbol_type': 'variable'}, {'name': 'Evals.runs', 'symbol_type': 'function'}, {'name': 'Evals.with_raw_response', 'symbol_type': 'function'}, {'name': 'Evals.with_streaming_response', 'symbol_type': 'function'}, {'name': 'Evals.create', 'symbol_type': 'function'}, {'name': 'Evals.retrieve', 'symbol_type': 'function'}, {'name': 'Evals.update', 'symbol_type': 'function'}, {'name': 'Evals.list', 'symbol_type': 'function'}, {'name': 'Evals.delete', 'symbol_type': 'function'}, {'name': 'AsyncEvals.runs', 'symbol_type': 'function'}, {'name': 'AsyncEvals.with_raw_response', 'symbol_type': 'function'}, {'name': 'AsyncEvals.with_streaming_response', 'symbol_type': 'function'}, {'name': 'AsyncEvals.create', 'symbol_type': 'function'}, {'name': 'AsyncEvals.retrieve', 'symbol_type': 'function'}, {'name': 'AsyncEvals.update', 'symbol_type': 'function'}, {'name': 'AsyncEvals.list', 'symbol_type': 'function'}, {'name': 'AsyncEvals.delete', 'symbol_type': 'function'}, {'name': 'EvalsWithRawResponse.__init__', 'symbol_type': 'function'}, {'name': 'EvalsWithRawResponse.runs', 'symbol_type': 'function'}, {'name': 'AsyncEvalsWithRawResponse.__init__', 'symbol_type': 'function'}, {'name': 'AsyncEvalsWithRawResponse.runs', 'symbol_type': 'function'}, {'name': 'EvalsWithStreamingResponse.__init__', 'symbol_type': 'function'}, {'name': 'EvalsWithStreamingResponse.runs', 'symbol_type': 'function'}, {'name': 'AsyncEvalsWithStreamingResponse.__init__', 'symbol_type': 'function'}, {'name': 'AsyncEvalsWithStreamingResponse.runs', 'symbol_type': 'function'}]\n",
- "[{'name': 'Evals', 'symbol_type': 'class'}, {'name': 'AsyncEvals', 'symbol_type': 'class'}, {'name': 'EvalsWithRawResponse', 'symbol_type': 'class'}, {'name': 'AsyncEvalsWithRawResponse', 'symbol_type': 'class'}, {'name': 'EvalsWithStreamingResponse', 'symbol_type': 'class'}, {'name': 'AsyncEvalsWithStreamingResponse', 'symbol_type': 'class'}, {'name': '__all__', 'symbol_type': 'variable'}]\n"
- ]
- }
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " Completions vs Responses Output Symbols\n",
+ "
\n",
+ "
\n",
+ " \n",
+ " \n",
+ " Completions Output | \n",
+ " Responses Output | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " name | \n",
+ " symbol_type | \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " Evals | \n",
+ " class | \n",
+ " \n",
+ " \n",
+ " AsyncEvals | \n",
+ " class | \n",
+ " \n",
+ " \n",
+ " EvalsWithRawResponse | \n",
+ " class | \n",
+ " \n",
+ " \n",
+ " AsyncEvalsWithRawResponse | \n",
+ " class | \n",
+ " \n",
+ " \n",
+ " EvalsWithStreamingResponse | \n",
+ " class | \n",
+ " \n",
+ " \n",
+ " AsyncEvalsWithStreamingResponse | \n",
+ " class | \n",
+ " \n",
+ " \n",
+ " __all__ | \n",
+ " variable | \n",
+ " \n",
+ " \n",
+ " \n",
+ " | \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " name | \n",
+ " symbol_type | \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " Evals | \n",
+ " class | \n",
+ " \n",
+ " \n",
+ " runs | \n",
+ " function | \n",
+ " \n",
+ " \n",
+ " with_raw_response | \n",
+ " function | \n",
+ " \n",
+ " \n",
+ " with_streaming_response | \n",
+ " function | \n",
+ " \n",
+ " \n",
+ " create | \n",
+ " function | \n",
+ " \n",
+ " \n",
+ " retrieve | \n",
+ " function | \n",
+ " \n",
+ " \n",
+ " update | \n",
+ " function | \n",
+ " \n",
+ " \n",
+ " list | \n",
+ " function | \n",
+ " \n",
+ " \n",
+ " delete | \n",
+ " function | \n",
+ " \n",
+ " \n",
+ " AsyncEvals | \n",
+ " class | \n",
+ " \n",
+ " \n",
+ " runs | \n",
+ " function | \n",
+ " \n",
+ " \n",
+ " with_raw_response | \n",
+ " function | \n",
+ " \n",
+ " \n",
+ " with_streaming_response | \n",
+ " function | \n",
+ " \n",
+ " \n",
+ " create | \n",
+ " function | \n",
+ " \n",
+ " \n",
+ " retrieve | \n",
+ " function | \n",
+ " \n",
+ " \n",
+ " update | \n",
+ " function | \n",
+ " \n",
+ " \n",
+ " list | \n",
+ " function | \n",
+ " \n",
+ " \n",
+ " delete | \n",
+ " function | \n",
+ " \n",
+ " \n",
+ " EvalsWithRawResponse | \n",
+ " class | \n",
+ " \n",
+ " \n",
+ " __init__ | \n",
+ " function | \n",
+ " \n",
+ " \n",
+ " runs | \n",
+ " function | \n",
+ " \n",
+ " \n",
+ " AsyncEvalsWithRawResponse | \n",
+ " class | \n",
+ " \n",
+ " \n",
+ " __init__ | \n",
+ " function | \n",
+ " \n",
+ " \n",
+ " runs | \n",
+ " function | \n",
+ " \n",
+ " \n",
+ " EvalsWithStreamingResponse | \n",
+ " class | \n",
+ " \n",
+ " \n",
+ " __init__ | \n",
+ " function | \n",
+ " \n",
+ " \n",
+ " runs | \n",
+ " function | \n",
+ " \n",
+ " \n",
+ " AsyncEvalsWithStreamingResponse | \n",
+ " class | \n",
+ " \n",
+ " \n",
+ " __init__ | \n",
+ " function | \n",
+ " \n",
+ " \n",
+ " runs | \n",
+ " function | \n",
+ " \n",
+ " \n",
+ " \n",
+ " | \n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
\n"
],
- "source": [
- "import json\n",
- "\n",
- "for item in completions_output:\n",
- " print(json.loads(item.sample.output[0].tool_calls[0][\"function\"][\"arguments\"])[\"symbols\"])\n",
- "\n",
- "for item in responses_output:\n",
- " print(json.loads(item.sample.output[0].tool_calls[0][\"function\"][\"arguments\"])[\"symbols\"])\n"
+ "text/plain": [
+ ""
]
+ },
+ "metadata": {},
+ "output_type": "display_data"
}
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "openai",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.12.9"
- }
+ ],
+ "source": [
+ "import json\n",
+ "import pandas as pd\n",
+ "from IPython.display import display, HTML\n",
+ "\n",
+ "def extract_symbols(output_list):\n",
+ " symbols_list = []\n",
+ " for item in output_list:\n",
+ " try:\n",
+ " args = item.sample.output[0].tool_calls[0][\"function\"][\"arguments\"]\n",
+ " symbols = json.loads(args)[\"symbols\"]\n",
+ " symbols_list.append(symbols)\n",
+ " except Exception as e:\n",
+ " symbols_list.append([{\"error\": str(e)}])\n",
+ " return symbols_list\n",
+ "\n",
+ "completions_symbols = extract_symbols(completions_output)\n",
+ "responses_symbols = extract_symbols(responses_output)\n",
+ "\n",
+ "def symbols_to_html_table(symbols):\n",
+ " if symbols and isinstance(symbols, list):\n",
+ " df = pd.DataFrame(symbols)\n",
+ " return (\n",
+ " df.style\n",
+ " .set_properties(**{\n",
+ " 'white-space': 'pre-wrap',\n",
+ " 'word-break': 'break-word',\n",
+ " 'padding': '2px 6px',\n",
+ " 'border': '1px solid #C3E7FA',\n",
+ " 'font-size': '0.92em',\n",
+ " 'background-color': '#FDFEFF'\n",
+ " })\n",
+ " .set_table_styles([{\n",
+ " 'selector': 'th',\n",
+ " 'props': [\n",
+ " ('font-size', '0.95em'),\n",
+ " ('background-color', '#1CA7EC'),\n",
+ " ('color', '#fff'),\n",
+ " ('border-bottom', '1px solid #18647E'),\n",
+ " ('padding', '2px 6px')\n",
+ " ]\n",
+ " }])\n",
+ " .hide(axis='index')\n",
+ " .to_html()\n",
+ " )\n",
+ " return f\"{str(symbols)}
\"\n",
+ "\n",
+ "table_rows = []\n",
+ "max_len = max(len(completions_symbols), len(responses_symbols))\n",
+ "for i in range(max_len):\n",
+ " c_html = symbols_to_html_table(completions_symbols[i]) if i < len(completions_symbols) else \"\"\n",
+ " r_html = symbols_to_html_table(responses_symbols[i]) if i < len(responses_symbols) else \"\"\n",
+ " table_rows.append(f\"\"\"\n",
+ " \n",
+ " {c_html} | \n",
+ " {r_html} | \n",
+ "
\n",
+ " \"\"\")\n",
+ "\n",
+ "table_html = f\"\"\"\n",
+ "\n",
+ "
\n",
+ " Completions vs Responses Output Symbols\n",
+ "
\n",
+ "
\n",
+ " \n",
+ " \n",
+ " Completions Output | \n",
+ " Responses Output | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " {''.join(table_rows)}\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "\"\"\"\n",
+ "\n",
+ "display(HTML(table_html))\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e8e4ca5a",
+ "metadata": {},
+ "source": [
+ "### Visualize Evals Dashboard\n",
+ "\n",
+ "You can navigate to the Evals Dashboard in order to visualize the data.\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n",
+ "You can also take a look at the explanation of the failed results in the Evals Dashboard after the run is complete as shown in the image below.\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "50ad84ad",
+ "metadata": {},
+ "source": [
+ "This notebook demonstrated how to use OpenAI Evals to assess and improve a model’s ability to extract structured information from Python code using tool calls. \n",
+ "\n",
+ "\n",
+ "OpenAI Evals provides a robust, reproducible framework for evaluating LLMs on structured extraction tasks. By combining clear tool schemas, rigorous grading rubrics, and well-structured datasets, you can measure and improve overall performance.\n",
+ "\n",
+ "*For more details, see the [OpenAI Evals documentation](https://platform.openai.com/docs/guides/evals).*"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": ".venv",
+ "language": "python",
+ "name": "python3"
},
- "nbformat": 4,
- "nbformat_minor": 2
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.11.8"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
}
diff --git a/images/eval_tools_fail.png b/images/eval_tools_fail.png
new file mode 100644
index 0000000000..b3e0ba49e7
Binary files /dev/null and b/images/eval_tools_fail.png differ
diff --git a/images/evals_tool_dashboard.png b/images/evals_tool_dashboard.png
new file mode 100644
index 0000000000..77c9338486
Binary files /dev/null and b/images/evals_tool_dashboard.png differ
diff --git a/registry.yaml b/registry.yaml
index 552b5fef48..c821bd077c 100644
--- a/registry.yaml
+++ b/registry.yaml
@@ -2158,6 +2158,7 @@
date: 2025-06-09
authors:
- josiah-openai
+ - shikhar-cyber
tags:
- evals-api
- responses