diff --git a/examples/evaluation/use-cases/tools-evaluation.ipynb b/examples/evaluation/use-cases/tools-evaluation.ipynb index cd5c72b52e..5bdf49829c 100644 --- a/examples/evaluation/use-cases/tools-evaluation.ipynb +++ b/examples/evaluation/use-cases/tools-evaluation.ipynb @@ -1,268 +1,736 @@ { - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Evaluating Code Symbol Extraction Quality with a Custom Dataset" - ] - }, + "cells": [ + { + "cell_type": "markdown", + "id": "6ff95379", + "metadata": {}, + "source": [ + "# Tool Evaluation with OpenAI Evals\n", + "\n", + "This cookbook shows how to **measure and improve a model’s ability to extract structured information from source code** with tool evaluation. In this case, the set of *symbols* (functions, classes, methods, and variables) defined in Python files. " + ] + }, + { + "cell_type": "markdown", + "id": "4cc30394", + "metadata": {}, + "source": [ + "## Setup\n", + "\n", + "Install the latest **openai** Python package ≥ 1.14.0 and set your `OPENAI_API_KEY` environment variable. If you also want to evaluate an *assistant with tools*, enable the *Assistants v2 beta* in your account.\n", + "\n", + "```bash\n", + "pip install --upgrade openai\n", + "export OPENAI_API_KEY=sk‑...\n", + "```\n", + "Below we import the SDK, create a client, and define a helper that builds a small dataset from files inside the **openai** package itself." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "acd0d746", + "metadata": {}, + "outputs": [ { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This notebook demonstrates how to evaluate a model's ability to extract symbols from code files using the OpenAI **Evals** framework with a custom in-memory dataset." - ] - }, + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.1.1\u001b[0m\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n", + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], + "source": [ + "%pip install --upgrade openai pandas jinja2 rich --quiet\n", + "\n", + "import os\n", + "import time\n", + "import openai\n", + "from rich import print\n", + "\n", + "client = openai.OpenAI(\n", + " api_key=os.getenv(\"OPENAI_API_KEY\") or os.getenv(\"_OPENAI_API_KEY\"),\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "80618b60", + "metadata": {}, + "source": [ + "### Dataset factory & grading rubric\n", + "* `get_dataset` builds a small in-memory dataset by reading several SDK files.\n", + "* `structured_output_grader` defines a detailed evaluation rubric. \n", + "* `sampled.output_tools[0].function.arguments.symbols` specifies the extracted symbols from the code file based on the tool invocation.\n", + "* `client.evals.create(...)` registers the eval with the platform." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "120b6e4d", + "metadata": { + "tags": [ + "original" + ] + }, + "outputs": [], + "source": [ + "def get_dataset(limit=None):\n", + " openai_sdk_file_path = os.path.dirname(openai.__file__)\n", + "\n", + " file_paths = [\n", + " os.path.join(openai_sdk_file_path, \"resources\", \"evals\", \"evals.py\"),\n", + " os.path.join(openai_sdk_file_path, \"resources\", \"responses\", \"responses.py\"),\n", + " os.path.join(openai_sdk_file_path, \"resources\", \"images.py\"),\n", + " os.path.join(openai_sdk_file_path, \"resources\", \"embeddings.py\"),\n", + " os.path.join(openai_sdk_file_path, \"resources\", \"files.py\"),\n", + " ]\n", + "\n", + " items = []\n", + " for file_path in file_paths:\n", + " items.append({\"input\": open(file_path, \"r\").read()})\n", + " if limit:\n", + " return items[:limit]\n", + " return items\n", + "\n", + "\n", + "structured_output_grader = \"\"\"\n", + "You are a helpful assistant that grades the quality of extracted information from a code file.\n", + "You will be given a code file and a list of extracted information.\n", + "You should grade the quality of the extracted information.\n", + "\n", + "You should grade the quality on a scale of 1 to 7.\n", + "You should apply the following criteria, and calculate your score as follows:\n", + "You should first check for completeness on a scale of 1 to 7.\n", + "Then you should apply a quality modifier.\n", + "\n", + "The quality modifier is a multiplier from 0 to 1 that you multiply by the completeness score.\n", + "If there is 100% coverage for completion and it is all high quality, then you would return 7*1.\n", + "If there is 100% coverage for completion but it is all low quality, then you would return 7*0.5.\n", + "etc.\n", + "\"\"\"\n", + "\n", + "structured_output_grader_user_prompt = \"\"\"\n", + "\n", + "{{item.input}}\n", + "\n", + "\n", + "\n", + "{{sample.output_tools[0].function.arguments.symbols}}\n", + "\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "id": "d7f66a56", + "metadata": {}, + "source": [ + "### Evals Creation\n", + "\n", + "Here we create an eval that will be used to evaluate the quality of extracted information from code files.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "95a5eaf6", + "metadata": {}, + "outputs": [], + "source": [ + "logs_eval = client.evals.create(\n", + " name=\"Code QA Eval\",\n", + " data_source_config={\n", + " \"type\": \"custom\",\n", + " \"item_schema\": {\"type\": \"object\", \"properties\": {\"input\": {\"type\": \"string\"}}},\n", + " \"include_sample_schema\": True,\n", + " },\n", + " testing_criteria=[\n", + " {\n", + " \"type\": \"score_model\",\n", + " \"name\": \"General Evaluator\",\n", + " \"model\": \"o3\",\n", + " \"input\": [\n", + " {\"role\": \"system\", \"content\": structured_output_grader},\n", + " {\"role\": \"user\", \"content\": structured_output_grader_user_prompt},\n", + " ],\n", + " \"range\": [1, 7],\n", + " \"pass_threshold\": 5.0,\n", + " }\n", + " ],\n", + ")\n", + "\n", + "symbol_tool = {\n", + " \"name\": \"extract_symbols\",\n", + " \"description\": \"Extract the symbols from the code file\",\n", + " \"parameters\": {\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"symbols\": {\n", + " \"type\": \"array\",\n", + " \"description\": \"A list of symbols extracted from Python code.\",\n", + " \"items\": {\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"name\": {\"type\": \"string\", \"description\": \"The name of the symbol.\"},\n", + " \"symbol_type\": {\"type\": \"string\", \"description\": \"The type of the symbol, e.g., variable, function, class.\"},\n", + " },\n", + " \"required\": [\"name\", \"symbol_type\"],\n", + " \"additionalProperties\": False,\n", + " },\n", + " }\n", + " },\n", + " \"required\": [\"symbols\"],\n", + " \"additionalProperties\": False,\n", + " },\n", + "}" + ] + }, + { + "cell_type": "markdown", + "id": "73ae7e5e", + "metadata": {}, + "source": [ + "### Kick off model runs\n", + "Here we launch two runs against the same eval: one that calls the **Completions** endpoint, and one that calls the **Responses** endpoint." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0d650e02", + "metadata": {}, + "outputs": [], + "source": [ + "gpt_4one_completions_run = client.evals.runs.create(\n", + " name=\"gpt-4.1\",\n", + " eval_id=logs_eval.id,\n", + " data_source={\n", + " \"type\": \"completions\",\n", + " \"source\": {\"type\": \"file_content\", \"content\": [{\"item\": item} for item in get_dataset(limit=1)]},\n", + " \"input_messages\": {\n", + " \"type\": \"template\",\n", + " \"template\": [\n", + " {\"type\": \"message\", \"role\": \"system\", \"content\": {\"type\": \"input_text\", \"text\": \"You are a helpful assistant.\"}},\n", + " {\"type\": \"message\", \"role\": \"user\", \"content\": {\"type\": \"input_text\", \"text\": \"Extract the symbols from the code file {{item.input}}\"}},\n", + " ],\n", + " },\n", + " \"model\": \"gpt-4.1\",\n", + " \"sampling_params\": {\n", + " \"seed\": 42,\n", + " \"temperature\": 0.7,\n", + " \"max_completions_tokens\": 10000,\n", + " \"top_p\": 0.9,\n", + " \"tools\": [{\"type\": \"function\", \"function\": symbol_tool}],\n", + " },\n", + " },\n", + ")\n", + "\n", + "gpt_4one_responses_run = client.evals.runs.create(\n", + " name=\"gpt-4.1-mini\",\n", + " eval_id=logs_eval.id,\n", + " data_source={\n", + " \"type\": \"responses\",\n", + " \"source\": {\"type\": \"file_content\", \"content\": [{\"item\": item} for item in get_dataset(limit=1)]},\n", + " \"input_messages\": {\n", + " \"type\": \"template\",\n", + " \"template\": [\n", + " {\"type\": \"message\", \"role\": \"system\", \"content\": {\"type\": \"input_text\", \"text\": \"You are a helpful assistant.\"}},\n", + " {\"type\": \"message\", \"role\": \"user\", \"content\": {\"type\": \"input_text\", \"text\": \"Extract the symbols from the code file {{item.input}}\"}},\n", + " ],\n", + " },\n", + " \"model\": \"gpt-4.1-mini\",\n", + " \"sampling_params\": {\n", + " \"seed\": 42,\n", + " \"temperature\": 0.7,\n", + " \"max_completions_tokens\": 10000,\n", + " \"top_p\": 0.9,\n", + " \"tools\": [{\"type\": \"function\", **symbol_tool}],\n", + " },\n", + " },\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "6ea31f2a", + "metadata": {}, + "source": [ + "### Utility Poller\n", + "\n", + "We create a utility poller that will be used to poll for the results of the eval runs." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fb8f3df4", + "metadata": {}, + "outputs": [ { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "evalrun_68437e5370c481919a6874594ca177d9 queued ResultCounts(errored=0, failed=0, passed=0, total=0)\n", - "evalrun_68437e544fe881918f76dbd8dce3fd15 queued ResultCounts(errored=0, failed=0, passed=0, total=0)\n", - "evalrun_68437e5370c481919a6874594ca177d9 in_progress ResultCounts(errored=0, failed=0, passed=0, total=0)\n", - "evalrun_68437e544fe881918f76dbd8dce3fd15 in_progress ResultCounts(errored=0, failed=0, passed=0, total=0)\n", - "evalrun_68437e5370c481919a6874594ca177d9 in_progress ResultCounts(errored=0, failed=0, passed=0, total=0)\n", - "evalrun_68437e544fe881918f76dbd8dce3fd15 in_progress ResultCounts(errored=0, failed=0, passed=0, total=0)\n", - "evalrun_68437e5370c481919a6874594ca177d9 in_progress ResultCounts(errored=0, failed=0, passed=0, total=0)\n", - "evalrun_68437e544fe881918f76dbd8dce3fd15 completed ResultCounts(errored=0, failed=0, passed=1, total=1)\n", - "evalrun_68437e5370c481919a6874594ca177d9 in_progress ResultCounts(errored=0, failed=0, passed=0, total=0)\n", - "evalrun_68437e544fe881918f76dbd8dce3fd15 completed ResultCounts(errored=0, failed=0, passed=1, total=1)\n", - "evalrun_68437e5370c481919a6874594ca177d9 completed ResultCounts(errored=0, failed=1, passed=0, total=1)\n", - "evalrun_68437e544fe881918f76dbd8dce3fd15 completed ResultCounts(errored=0, failed=0, passed=1, total=1)\n" - ] - } + "data": { + "text/html": [ + "
evalrun_6848e2269570819198b757fe12b979da completed\n",
+       "ResultCounts(errored=0, failed=1, passed=0, total=1)\n",
+       "
\n" ], - "source": [ - "import os\n", - "import time\n", - "\n", - "import openai\n", - "\n", - "client = openai.OpenAI(\n", - " api_key=os.getenv(\"OPENAI_API_KEY\") or os.getenv(\"_OPENAI_API_KEY\"),\n", - ")\n", - "\n", - "\n", - "def get_dataset(limit=None):\n", - " openai_sdk_file_path = os.path.dirname(openai.__file__)\n", - "\n", - " file_paths = [\n", - " os.path.join(openai_sdk_file_path, \"resources\", \"evals\", \"evals.py\"),\n", - " os.path.join(openai_sdk_file_path, \"resources\", \"responses\", \"responses.py\"),\n", - " os.path.join(openai_sdk_file_path, \"resources\", \"images.py\"),\n", - " os.path.join(openai_sdk_file_path, \"resources\", \"embeddings.py\"),\n", - " os.path.join(openai_sdk_file_path, \"resources\", \"files.py\"),\n", - " ]\n", - "\n", - " items = []\n", - " for file_path in file_paths:\n", - " items.append({\"input\": open(file_path, \"r\").read()})\n", - " if limit:\n", - " return items[:limit]\n", - " return items\n", - "\n", - "\n", - "structured_output_grader = \"\"\"\n", - "You are a helpful assistant that grades the quality of extracted information from a code file.\n", - "You will be given a code file and a list of extracted information.\n", - "You should grade the quality of the extracted information.\n", - "\n", - "You should grade the quality on a scale of 1 to 7.\n", - "You should apply the following criteria, and calculate your score as follows:\n", - "You should first check for completeness on a scale of 1 to 7.\n", - "Then you should apply a quality modifier.\n", - "\n", - "The quality modifier is a multiplier from 0 to 1 that you multiply by the completeness score.\n", - "If there is 100% coverage for completion and it is all high quality, then you would return 7*1.\n", - "If there is 100% coverage for completion but it is all low quality, then you would return 7*0.5.\n", - "etc.\n", - "\"\"\"\n", - "\n", - "structured_output_grader_user_prompt = \"\"\"\n", - "\n", - "{{item.input}}\n", - "\n", - "\n", - "\n", - "{{sample.output_tools[0].function.arguments.symbols}}\n", - "\n", - "\"\"\"\n", - "\n", - "logs_eval = client.evals.create(\n", - " name=\"Code QA Eval\",\n", - " data_source_config={\n", - " \"type\": \"custom\",\n", - " \"item_schema\": {\"type\": \"object\", \"properties\": {\"input\": {\"type\": \"string\"}}},\n", - " \"include_sample_schema\": True,\n", - " },\n", - " testing_criteria=[\n", - " {\n", - " \"type\": \"score_model\",\n", - " \"name\": \"General Evaluator\",\n", - " \"model\": \"o3\",\n", - " \"input\": [\n", - " {\"role\": \"system\", \"content\": structured_output_grader},\n", - " {\"role\": \"user\", \"content\": structured_output_grader_user_prompt},\n", - " ],\n", - " \"range\": [1, 7],\n", - " \"pass_threshold\": 5.5,\n", - " }\n", - " ],\n", - ")\n", - "\n", - "symbol_tool = {\n", - " \"name\": \"extract_symbols\",\n", - " \"description\": \"Extract the symbols from the code file\",\n", - " \"parameters\": {\n", - " \"type\": \"object\",\n", - " \"properties\": {\n", - " \"symbols\": {\n", - " \"type\": \"array\",\n", - " \"description\": \"A list of symbols extracted from Python code.\",\n", - " \"items\": {\n", - " \"type\": \"object\",\n", - " \"properties\": {\n", - " \"name\": {\"type\": \"string\", \"description\": \"The name of the symbol.\"},\n", - " \"symbol_type\": {\"type\": \"string\", \"description\": \"The type of the symbol, e.g., variable, function, class.\"},\n", - " },\n", - " \"required\": [\"name\", \"symbol_type\"],\n", - " \"additionalProperties\": False,\n", - " },\n", - " }\n", - " },\n", - " \"required\": [\"symbols\"],\n", - " \"additionalProperties\": False,\n", - " },\n", - "}\n", - "\n", - "gpt_4one_completions_run = client.evals.runs.create(\n", - " name=\"gpt-4.1\",\n", - " eval_id=logs_eval.id,\n", - " data_source={\n", - " \"type\": \"completions\",\n", - " \"source\": {\"type\": \"file_content\", \"content\": [{\"item\": item} for item in get_dataset(limit=1)]},\n", - " \"input_messages\": {\n", - " \"type\": \"template\",\n", - " \"template\": [\n", - " {\"type\": \"message\", \"role\": \"system\", \"content\": {\"type\": \"input_text\", \"text\": \"You are a helpful assistant.\"}},\n", - " {\"type\": \"message\", \"role\": \"user\", \"content\": {\"type\": \"input_text\", \"text\": \"Extract the symbols from the code file {{item.input}}\"}},\n", - " ],\n", - " },\n", - " \"model\": \"gpt-4.1\",\n", - " \"sampling_params\": {\n", - " \"seed\": 42,\n", - " \"temperature\": 0.7,\n", - " \"max_completions_tokens\": 10000,\n", - " \"top_p\": 0.9,\n", - " \"tools\": [{\"type\": \"function\", \"function\": symbol_tool}],\n", - " },\n", - " },\n", - ")\n", - "\n", - "gpt_4one_responses_run = client.evals.runs.create(\n", - " name=\"gpt-4.1\",\n", - " eval_id=logs_eval.id,\n", - " data_source={\n", - " \"type\": \"responses\",\n", - " \"source\": {\"type\": \"file_content\", \"content\": [{\"item\": item} for item in get_dataset(limit=1)]},\n", - " \"input_messages\": {\n", - " \"type\": \"template\",\n", - " \"template\": [\n", - " {\"type\": \"message\", \"role\": \"system\", \"content\": {\"type\": \"input_text\", \"text\": \"You are a helpful assistant.\"}},\n", - " {\"type\": \"message\", \"role\": \"user\", \"content\": {\"type\": \"input_text\", \"text\": \"Extract the symbols from the code file {{item.input}}\"}},\n", - " ],\n", - " },\n", - " \"model\": \"gpt-4.1\",\n", - " \"sampling_params\": {\n", - " \"seed\": 42,\n", - " \"temperature\": 0.7,\n", - " \"max_completions_tokens\": 10000,\n", - " \"top_p\": 0.9,\n", - " \"tools\": [{\"type\": \"function\", **symbol_tool}],\n", - " },\n", - " },\n", - ")\n", - "\n", - "\n", - "def poll_runs(eval_id, run_ids):\n", - " # poll both runs at the same time, until they are complete or failed\n", - " while True:\n", - " runs = [client.evals.runs.retrieve(run_id, eval_id=eval_id) for run_id in run_ids]\n", - " for run in runs:\n", - " print(run.id, run.status, run.result_counts)\n", - " if all(run.status in (\"completed\", \"failed\") for run in runs):\n", - " break\n", - " time.sleep(5)\n", - "\n", - "\n", - "poll_runs(logs_eval.id, [gpt_4one_completions_run.id, gpt_4one_responses_run.id])\n" + "text/plain": [ + "evalrun_6848e2269570819198b757fe12b979da completed\n", + "\u001b[1;35mResultCounts\u001b[0m\u001b[1m(\u001b[0m\u001b[33merrored\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mfailed\u001b[0m=\u001b[1;36m1\u001b[0m, \u001b[33mpassed\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mtotal\u001b[0m=\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m\n" ] + }, + "metadata": {}, + "output_type": "display_data" }, { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "completions_output = client.evals.runs.output_items.list(\n", - " run_id=gpt_4one_completions_run.id, eval_id=logs_eval.id\n", - ")\n", - "\n", - "responses_output = client.evals.runs.output_items.list(\n", - " run_id=gpt_4one_responses_run.id, eval_id=logs_eval.id\n", - ")\n" + "data": { + "text/html": [ + "
evalrun_6848e227d3a481918a9b970c897b5998 completed\n",
+       "ResultCounts(errored=0, failed=1, passed=0, total=1)\n",
+       "
\n" + ], + "text/plain": [ + "evalrun_6848e227d3a481918a9b970c897b5998 completed\n", + "\u001b[1;35mResultCounts\u001b[0m\u001b[1m(\u001b[0m\u001b[33merrored\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mfailed\u001b[0m=\u001b[1;36m1\u001b[0m, \u001b[33mpassed\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mtotal\u001b[0m=\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m\n" ] - }, + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "def poll_runs(eval_id, run_ids):\n", + " # poll both runs at the same time, until they are complete or failed\n", + " while True:\n", + " runs = [client.evals.runs.retrieve(run_id, eval_id=eval_id) for run_id in run_ids]\n", + " for run in runs:\n", + " print(run.id, run.status, run.result_counts)\n", + " if all(run.status in (\"completed\", \"failed\") for run in runs):\n", + " break\n", + " time.sleep(5)\n", + "\n", + "\n", + "poll_runs(logs_eval.id, [gpt_4one_completions_run.id, gpt_4one_responses_run.id])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f4014cde", + "metadata": { + "tags": [ + "original" + ] + }, + "outputs": [], + "source": [ + "\n", + "### Get Output\n", + "completions_output = client.evals.runs.output_items.list(\n", + " run_id=gpt_4one_completions_run.id, eval_id=logs_eval.id\n", + ")\n", + "\n", + "responses_output = client.evals.runs.output_items.list(\n", + " run_id=gpt_4one_responses_run.id, eval_id=logs_eval.id\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "88ae7e17", + "metadata": {}, + "source": [ + "### Inspecting results\n", + "\n", + "For both completions and responses, we print the *symbols* dictionary that the model returned. You can diff this against the reference answer or compute precision / recall." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c0cddb6d", + "metadata": { + "tags": [ + "original" + ] + }, + "outputs": [ { - "cell_type": "code", - "execution_count": 13, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[{'name': 'Evals', 'symbol_type': 'class'}, {'name': 'AsyncEvals', 'symbol_type': 'class'}, {'name': 'EvalsWithRawResponse', 'symbol_type': 'class'}, {'name': 'AsyncEvalsWithRawResponse', 'symbol_type': 'class'}, {'name': 'EvalsWithStreamingResponse', 'symbol_type': 'class'}, {'name': 'AsyncEvalsWithStreamingResponse', 'symbol_type': 'class'}, {'name': '__all__', 'symbol_type': 'variable'}, {'name': 'Evals.runs', 'symbol_type': 'function'}, {'name': 'Evals.with_raw_response', 'symbol_type': 'function'}, {'name': 'Evals.with_streaming_response', 'symbol_type': 'function'}, {'name': 'Evals.create', 'symbol_type': 'function'}, {'name': 'Evals.retrieve', 'symbol_type': 'function'}, {'name': 'Evals.update', 'symbol_type': 'function'}, {'name': 'Evals.list', 'symbol_type': 'function'}, {'name': 'Evals.delete', 'symbol_type': 'function'}, {'name': 'AsyncEvals.runs', 'symbol_type': 'function'}, {'name': 'AsyncEvals.with_raw_response', 'symbol_type': 'function'}, {'name': 'AsyncEvals.with_streaming_response', 'symbol_type': 'function'}, {'name': 'AsyncEvals.create', 'symbol_type': 'function'}, {'name': 'AsyncEvals.retrieve', 'symbol_type': 'function'}, {'name': 'AsyncEvals.update', 'symbol_type': 'function'}, {'name': 'AsyncEvals.list', 'symbol_type': 'function'}, {'name': 'AsyncEvals.delete', 'symbol_type': 'function'}, {'name': 'EvalsWithRawResponse.__init__', 'symbol_type': 'function'}, {'name': 'EvalsWithRawResponse.runs', 'symbol_type': 'function'}, {'name': 'AsyncEvalsWithRawResponse.__init__', 'symbol_type': 'function'}, {'name': 'AsyncEvalsWithRawResponse.runs', 'symbol_type': 'function'}, {'name': 'EvalsWithStreamingResponse.__init__', 'symbol_type': 'function'}, {'name': 'EvalsWithStreamingResponse.runs', 'symbol_type': 'function'}, {'name': 'AsyncEvalsWithStreamingResponse.__init__', 'symbol_type': 'function'}, {'name': 'AsyncEvalsWithStreamingResponse.runs', 'symbol_type': 'function'}]\n", - "[{'name': 'Evals', 'symbol_type': 'class'}, {'name': 'AsyncEvals', 'symbol_type': 'class'}, {'name': 'EvalsWithRawResponse', 'symbol_type': 'class'}, {'name': 'AsyncEvalsWithRawResponse', 'symbol_type': 'class'}, {'name': 'EvalsWithStreamingResponse', 'symbol_type': 'class'}, {'name': 'AsyncEvalsWithStreamingResponse', 'symbol_type': 'class'}, {'name': '__all__', 'symbol_type': 'variable'}]\n" - ] - } + "data": { + "text/html": [ + "\n", + "
\n", + "

\n", + " Completions vs Responses Output Symbols\n", + "

\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Completions OutputResponses Output
\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
namesymbol_type
Evalsclass
AsyncEvalsclass
EvalsWithRawResponseclass
AsyncEvalsWithRawResponseclass
EvalsWithStreamingResponseclass
AsyncEvalsWithStreamingResponseclass
__all__variable
\n", + "
\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
namesymbol_type
Evalsclass
runsfunction
with_raw_responsefunction
with_streaming_responsefunction
createfunction
retrievefunction
updatefunction
listfunction
deletefunction
AsyncEvalsclass
runsfunction
with_raw_responsefunction
with_streaming_responsefunction
createfunction
retrievefunction
updatefunction
listfunction
deletefunction
EvalsWithRawResponseclass
__init__function
runsfunction
AsyncEvalsWithRawResponseclass
__init__function
runsfunction
EvalsWithStreamingResponseclass
__init__function
runsfunction
AsyncEvalsWithStreamingResponseclass
__init__function
runsfunction
\n", + "
\n", + "
\n" ], - "source": [ - "import json\n", - "\n", - "for item in completions_output:\n", - " print(json.loads(item.sample.output[0].tool_calls[0][\"function\"][\"arguments\"])[\"symbols\"])\n", - "\n", - "for item in responses_output:\n", - " print(json.loads(item.sample.output[0].tool_calls[0][\"function\"][\"arguments\"])[\"symbols\"])\n" + "text/plain": [ + "" ] + }, + "metadata": {}, + "output_type": "display_data" } - ], - "metadata": { - "kernelspec": { - "display_name": "openai", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.12.9" - } + ], + "source": [ + "import json\n", + "import pandas as pd\n", + "from IPython.display import display, HTML\n", + "\n", + "def extract_symbols(output_list):\n", + " symbols_list = []\n", + " for item in output_list:\n", + " try:\n", + " args = item.sample.output[0].tool_calls[0][\"function\"][\"arguments\"]\n", + " symbols = json.loads(args)[\"symbols\"]\n", + " symbols_list.append(symbols)\n", + " except Exception as e:\n", + " symbols_list.append([{\"error\": str(e)}])\n", + " return symbols_list\n", + "\n", + "completions_symbols = extract_symbols(completions_output)\n", + "responses_symbols = extract_symbols(responses_output)\n", + "\n", + "def symbols_to_html_table(symbols):\n", + " if symbols and isinstance(symbols, list):\n", + " df = pd.DataFrame(symbols)\n", + " return (\n", + " df.style\n", + " .set_properties(**{\n", + " 'white-space': 'pre-wrap',\n", + " 'word-break': 'break-word',\n", + " 'padding': '2px 6px',\n", + " 'border': '1px solid #C3E7FA',\n", + " 'font-size': '0.92em',\n", + " 'background-color': '#FDFEFF'\n", + " })\n", + " .set_table_styles([{\n", + " 'selector': 'th',\n", + " 'props': [\n", + " ('font-size', '0.95em'),\n", + " ('background-color', '#1CA7EC'),\n", + " ('color', '#fff'),\n", + " ('border-bottom', '1px solid #18647E'),\n", + " ('padding', '2px 6px')\n", + " ]\n", + " }])\n", + " .hide(axis='index')\n", + " .to_html()\n", + " )\n", + " return f\"
{str(symbols)}
\"\n", + "\n", + "table_rows = []\n", + "max_len = max(len(completions_symbols), len(responses_symbols))\n", + "for i in range(max_len):\n", + " c_html = symbols_to_html_table(completions_symbols[i]) if i < len(completions_symbols) else \"\"\n", + " r_html = symbols_to_html_table(responses_symbols[i]) if i < len(responses_symbols) else \"\"\n", + " table_rows.append(f\"\"\"\n", + " \n", + " {c_html}\n", + " {r_html}\n", + " \n", + " \"\"\")\n", + "\n", + "table_html = f\"\"\"\n", + "
\n", + "

\n", + " Completions vs Responses Output Symbols\n", + "

\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " {''.join(table_rows)}\n", + " \n", + "
Completions OutputResponses Output
\n", + "
\n", + "\"\"\"\n", + "\n", + "display(HTML(table_html))\n" + ] + }, + { + "cell_type": "markdown", + "id": "e8e4ca5a", + "metadata": {}, + "source": [ + "### Visualize Evals Dashboard\n", + "\n", + "You can navigate to the Evals Dashboard in order to visualize the data.\n", + "\n", + "\n", + "![evals_tool_dashboard](../../../images/evals_tool_dashboard.png)\n", + "\n", + "\n", + "You can also take a look at the explanation of the failed results in the Evals Dashboard after the run is complete as shown in the image below.\n", + "\n", + "![evals_tool_failed](../../../images/eval_tools_fail.png)\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "50ad84ad", + "metadata": {}, + "source": [ + "This notebook demonstrated how to use OpenAI Evals to assess and improve a model’s ability to extract structured information from Python code using tool calls. \n", + "\n", + "\n", + "OpenAI Evals provides a robust, reproducible framework for evaluating LLMs on structured extraction tasks. By combining clear tool schemas, rigorous grading rubrics, and well-structured datasets, you can measure and improve overall performance.\n", + "\n", + "*For more details, see the [OpenAI Evals documentation](https://platform.openai.com/docs/guides/evals).*" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" }, - "nbformat": 4, - "nbformat_minor": 2 + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.8" + } + }, + "nbformat": 4, + "nbformat_minor": 5 } diff --git a/images/eval_tools_fail.png b/images/eval_tools_fail.png new file mode 100644 index 0000000000..b3e0ba49e7 Binary files /dev/null and b/images/eval_tools_fail.png differ diff --git a/images/evals_tool_dashboard.png b/images/evals_tool_dashboard.png new file mode 100644 index 0000000000..77c9338486 Binary files /dev/null and b/images/evals_tool_dashboard.png differ diff --git a/registry.yaml b/registry.yaml index 552b5fef48..c821bd077c 100644 --- a/registry.yaml +++ b/registry.yaml @@ -2158,6 +2158,7 @@ date: 2025-06-09 authors: - josiah-openai + - shikhar-cyber tags: - evals-api - responses