diff --git a/examples/evaluation/use-cases/structured-outputs-evaluation.ipynb b/examples/evaluation/use-cases/structured-outputs-evaluation.ipynb
index d255fe79aa..37c21f450a 100644
--- a/examples/evaluation/use-cases/structured-outputs-evaluation.ipynb
+++ b/examples/evaluation/use-cases/structured-outputs-evaluation.ipynb
@@ -1,11 +1,92 @@
{
"cells": [
+ {
+ "cell_type": "markdown",
+ "id": "0a2d56c0",
+ "metadata": {},
+ "source": [
+ "\n",
+ "# Structured Output Evaluation Cookbook\n",
+ " \n",
+ "This notebook walks you through a set of focused, runnable examples how to use the OpenAI **Evals** framework to **test, grade, and iterate on tasks that require large‑language models to produce structured outputs**.\n",
+ "\n",
+ "> **Why does this matter?** \n",
+ "> Production systems often depend on JSON, SQL, or domain‑specific formats. Relying on spot checks or ad‑hoc prompt tweaks quickly breaks down. Instead, you can *codify* expectations as automated evals and let your team ship with safety bricks instead of sand.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "45eee293",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## Quick Tour\n",
+ "\n",
+ "* **Section 1 – Prerequisites**: environment variables and package setup \n",
+ "* **Section 2 – Walk‑through: Code‑symbol extraction**: end‑to‑end demo that grades the model’s ability to extract function and class names from source code. We keep the original logic intact and simply layer documentation around it. \n",
+ "* **Section 3 – Additional Recipes**: sketches of common production patterns such as sentiment extraction as additional code sample for evaluation.\n",
+ "* **Section 4 – Result Exploration**: lightweight helpers for pulling run output and digging into failures. \n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e027be46",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## Prerequisites\n",
+ "\n",
+ "1. **Install dependencies** (minimum versions shown):\n",
+ "\n",
+ "```bash\n",
+ "pip install --upgrade openai\n",
+ "```\n",
+ "\n",
+ "2. **Authenticate** by exporting your key:\n",
+ "\n",
+ "```bash\n",
+ "export OPENAI_API_KEY=\"sk‑...\"\n",
+ "```\n",
+ "\n",
+ "3. **Optional**: if you plan to run evals in bulk, set up an [organization‑level key](https://platform.openai.com/account/org-settings) with appropriate limits.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4592675d",
+ "metadata": {},
+ "source": [
+ "### Use Case 1: Code symbol extraction"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d2a32d53",
+ "metadata": {},
+ "source": [
+ "\n",
+ "The goal is to **extract all function, class, and constant symbols from python files inside the OpenAI SDK**. \n",
+ "For each file we ask the model to emit structured JSON like:\n",
+ "\n",
+ "```json\n",
+ "{\n",
+ " \"symbols\": [\n",
+ " {\"name\": \"OpenAI\", \"kind\": \"class\"},\n",
+ " {\"name\": \"Evals\", \"kind\": \"module\"},\n",
+ " ...\n",
+ " ]\n",
+ "}\n",
+ "```\n",
+ "\n",
+ "A rubric model then grades **completeness** (did we capture every symbol?) and **quality** (are the kinds correct?) on a 1‑7 scale.\n"
+ ]
+ },
{
"cell_type": "markdown",
"id": "9dd88e7c",
"metadata": {},
"source": [
- "# Evaluating Code Quality Extraction with a Custom Dataset"
+ "### Evaluating Code Quality Extraction with a Custom Dataset"
]
},
{
@@ -13,28 +94,65 @@
"id": "64bf0667",
"metadata": {},
"source": [
- "This notebook demonstrates how to evaluate a model's ability to extract symbols from code using the OpenAI **Evals** framework with a custom in-memory dataset."
+ "Let us walk though an example to evaluate a model's ability to extract symbols from code using the OpenAI **Evals** framework with a custom in-memory dataset."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c95faa47",
+ "metadata": {},
+ "source": [
+ "### Initialize SDK client\n",
+ "Creates an `openai.OpenAI` client using the `OPENAI_API_KEY` we exported above. Nothing will run without this."
]
},
{
"cell_type": "code",
- "execution_count": 8,
+ "execution_count": 11,
"id": "eacc6ac7",
"metadata": {},
- "outputs": [],
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n",
+ "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.1.1\u001b[0m\n",
+ "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n",
+ "Note: you may need to restart the kernel to use updated packages.\n"
+ ]
+ }
+ ],
"source": [
+ "%pip install --upgrade openai pandas rich --quiet\n",
+ "\n",
+ "\n",
+ "\n",
"import os\n",
"import time\n",
"import openai\n",
+ "from rich import print\n",
+ "import pandas as pd\n",
"\n",
"client = openai.OpenAI(\n",
" api_key=os.getenv(\"OPENAI_API_KEY\") or os.getenv(\"_OPENAI_API_KEY\"),\n",
")"
]
},
+ {
+ "cell_type": "markdown",
+ "id": "8200aaf1",
+ "metadata": {},
+ "source": [
+ "### Dataset factory & grading rubric\n",
+ "* `get_dataset` builds a small in-memory dataset by reading several SDK files.\n",
+ "* `structured_output_grader` defines a detailed evaluation rubric.\n",
+ "* `client.evals.create(...)` registers the eval with the platform."
+ ]
+ },
{
"cell_type": "code",
- "execution_count": 9,
+ "execution_count": 4,
"id": "b272e193",
"metadata": {},
"outputs": [],
@@ -110,13 +228,23 @@
")"
]
},
+ {
+ "cell_type": "markdown",
+ "id": "4e77cbe6",
+ "metadata": {},
+ "source": [
+ "### Kick off model runs\n",
+ "Here we launch two runs against the same eval: one that calls the **Completions** endpoint, and one that calls the **Responses** endpoint."
+ ]
+ },
{
"cell_type": "code",
- "execution_count": 10,
+ "execution_count": 5,
"id": "18f357e6",
"metadata": {},
"outputs": [],
"source": [
+ "### Kick off model runs\n",
"gpt_4one_completions_run = client.evals.runs.create(\n",
" name=\"gpt-4.1\",\n",
" eval_id=logs_eval.id,\n",
@@ -251,13 +379,54 @@
")"
]
},
+ {
+ "cell_type": "markdown",
+ "id": "dd0aa0c0",
+ "metadata": {},
+ "source": [
+ "### Utility poller\n",
+ "Next, we will use a simple loop that waits for all runs to finish, then saves each run’s JSON to disk so you can inspect it later or attach it to CI artifacts."
+ ]
+ },
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 7,
"id": "cbc4f775",
"metadata": {},
- "outputs": [],
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
evalrun_68487dcc749081918ec2571e76cc9ef6 completed\n",
+ "ResultCounts(errored=0, failed=1, passed=0, total=1)\n",
+ "
\n"
+ ],
+ "text/plain": [
+ "evalrun_68487dcc749081918ec2571e76cc9ef6 completed\n",
+ "\u001b[1;35mResultCounts\u001b[0m\u001b[1m(\u001b[0m\u001b[33merrored\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mfailed\u001b[0m=\u001b[1;36m1\u001b[0m, \u001b[33mpassed\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mtotal\u001b[0m=\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m\n"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/html": [
+ "evalrun_68487dcdaba0819182db010fe5331f2e completed\n",
+ "ResultCounts(errored=0, failed=1, passed=0, total=1)\n",
+ "
\n"
+ ],
+ "text/plain": [
+ "evalrun_68487dcdaba0819182db010fe5331f2e completed\n",
+ "\u001b[1;35mResultCounts\u001b[0m\u001b[1m(\u001b[0m\u001b[33merrored\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mfailed\u001b[0m=\u001b[1;36m1\u001b[0m, \u001b[33mpassed\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mtotal\u001b[0m=\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m\n"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
"source": [
+ "### Utility poller\n",
"def poll_runs(eval_id, run_ids):\n",
" while True:\n",
" runs = [client.evals.runs.retrieve(rid, eval_id=eval_id) for rid in run_ids]\n",
@@ -278,9 +447,18 @@
"poll_runs(logs_eval.id, [gpt_4one_completions_run.id, gpt_4one_responses_run.id])"
]
},
+ {
+ "cell_type": "markdown",
+ "id": "77331859",
+ "metadata": {},
+ "source": [
+ "### Load outputs for quick inspection\n",
+ "We will fetch the output items for both runs so we can print or post‑process them."
+ ]
+ },
{
"cell_type": "code",
- "execution_count": 12,
+ "execution_count": 8,
"id": "c316e6eb",
"metadata": {},
"outputs": [],
@@ -294,26 +472,370 @@
")"
]
},
+ {
+ "cell_type": "markdown",
+ "id": "1cc61c54",
+ "metadata": {},
+ "source": [
+ "### Human-readable dump\n",
+ "Let us print a side-by-side view of completions vs responses."
+ ]
+ },
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 20,
"id": "9f1b502e",
"metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "Completions vs Responses Output\n",
+ "
\n"
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ " \n",
+ " \n",
+ " Completions Output | \n",
+ " Responses Output | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " {\"symbols\":[{\"name\":\"Evals\",\"symbol_type\":\"class\"},{\"name\":\"AsyncEvals\",\"symbol_type\":\"class\"},{\"name\":\"EvalsWithRawResponse\",\"symbol_type\":\"class\"},{\"name\":\"AsyncEvalsWithRawResponse\",\"symbol_type\":\"class\"},{\"name\":\"EvalsWithStreamingResponse\",\"symb... | \n",
+ " {\"symbols\":[{\"name\":\"Evals\",\"symbol_type\":\"class\"},{\"name\":\"runs\",\"symbol_type\":\"property\"},{\"name\":\"with_raw_response\",\"symbol_type\":\"property\"},{\"name\":\"with_streaming_response\",\"symbol_type\":\"property\"},{\"name\":\"create\",\"symbol_type\":\"function\"},{... | \n",
+ "
\n",
+ " \n",
+ "
\n"
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "from IPython.display import display, HTML\n",
+ "\n",
+ "# Collect outputs for both runs\n",
+ "completions_outputs = [item.sample.output[0].content for item in completions_output]\n",
+ "responses_outputs = [item.sample.output[0].content for item in responses_output]\n",
+ "\n",
+ "# Create DataFrame for side-by-side display (truncated to 250 chars for readability)\n",
+ "df = pd.DataFrame({\n",
+ " \"Completions Output\": [c[:250].replace('\\n', ' ') + ('...' if len(c) > 250 else '') for c in completions_outputs],\n",
+ " \"Responses Output\": [r[:250].replace('\\n', ' ') + ('...' if len(r) > 250 else '') for r in responses_outputs]\n",
+ "})\n",
+ "\n",
+ "# Custom color scheme\n",
+ "custom_styles = [\n",
+ " {'selector': 'th', 'props': [('font-size', '1.1em'), ('background-color', '#323C50'), ('color', '#FFFFFF'), ('border-bottom', '2px solid #1CA7EC')]},\n",
+ " {'selector': 'td', 'props': [('font-size', '1em'), ('max-width', '650px'), ('background-color', '#F6F8FA'), ('color', '#222'), ('border-bottom', '1px solid #DDD')]},\n",
+ " {'selector': 'tr:hover td', 'props': [('background-color', '#D1ECF1'), ('color', '#18647E')]},\n",
+ " {'selector': 'tbody tr:nth-child(even) td', 'props': [('background-color', '#E8F1FB')]},\n",
+ " {'selector': 'tbody tr:nth-child(odd) td', 'props': [('background-color', '#F6F8FA')]},\n",
+ " {'selector': 'table', 'props': [('border-collapse', 'collapse'), ('border-radius', '6px'), ('overflow', 'hidden')]},\n",
+ "]\n",
+ "\n",
+ "styled = (\n",
+ " df.style\n",
+ " .set_properties(**{'white-space': 'pre-wrap', 'word-break': 'break-word', 'padding': '8px'})\n",
+ " .set_table_styles(custom_styles)\n",
+ " .hide(axis=\"index\")\n",
+ ")\n",
+ "\n",
+ "display(HTML(\"\"\"\n",
+ "\n",
+ "Completions vs Responses Output\n",
+ "
\n",
+ "\"\"\"))\n",
+ "display(styled)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8cbe934f",
+ "metadata": {},
+ "source": [
+ "### Visualize the Results\n",
+ "\n",
+ "Below are visualizations that represent the evaluation data and code outputs for structured QA evaluation. These images provide insights into the data distribution and the evaluation workflow.\n",
+ "\n",
+ "---\n",
+ "\n",
+ "**Evaluation Data Overview**\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n",
+ "---\n",
+ "\n",
+ "**Evaluation Code Workflow**\n",
+ "\n",
+ "\n",
+ "\n",
+ "---\n",
+ "\n",
+ "By reviewing these visualizations, you can better understand the structure of the evaluation dataset and the steps involved in evaluating structured outputs for QA tasks.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a0ae89ef",
+ "metadata": {},
+ "source": [
+ "### Use Case 2: Multi-lingual Sentiment Extraction\n",
+ "In a similar way, let us evaluate a multi-lingual sentiment extraction model with structured outputs."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 29,
+ "id": "e5f0b782",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Sample in-memory dataset for sentiment extraction\n",
+ "sentiment_dataset = [\n",
+ " {\n",
+ " \"text\": \"I love this product!\",\n",
+ " \"channel\": \"twitter\",\n",
+ " \"language\": \"en\"\n",
+ " },\n",
+ " {\n",
+ " \"text\": \"This is the worst experience I've ever had.\",\n",
+ " \"channel\": \"support_ticket\",\n",
+ " \"language\": \"en\"\n",
+ " },\n",
+ " {\n",
+ " \"text\": \"It's okay – not great but not bad either.\",\n",
+ " \"channel\": \"app_review\",\n",
+ " \"language\": \"en\"\n",
+ " },\n",
+ " {\n",
+ " \"text\": \"No estoy seguro de lo que pienso sobre este producto.\",\n",
+ " \"channel\": \"facebook\",\n",
+ " \"language\": \"es\"\n",
+ " },\n",
+ " {\n",
+ " \"text\": \"总体来说,我对这款产品很满意。\",\n",
+ " \"channel\": \"wechat\",\n",
+ " \"language\": \"zh\"\n",
+ " },\n",
+ "]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 31,
+ "id": "cb6954f4",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Define output schema\n",
+ "sentiment_output_schema = {\n",
+ " \"type\": \"object\",\n",
+ " \"properties\": {\n",
+ " \"sentiment\": {\n",
+ " \"type\": \"string\",\n",
+ " \"description\": \"overall label: positive / negative / neutral\"\n",
+ " },\n",
+ " \"confidence\": {\n",
+ " \"type\": \"number\",\n",
+ " \"description\": \"confidence score 0-1\"\n",
+ " },\n",
+ " \"emotions\": {\n",
+ " \"type\": \"array\",\n",
+ " \"description\": \"list of dominant emotions (e.g. joy, anger)\",\n",
+ " \"items\": {\"type\": \"string\"}\n",
+ " }\n",
+ " },\n",
+ " \"required\": [\"sentiment\", \"confidence\", \"emotions\"],\n",
+ " \"additionalProperties\": False\n",
+ "}\n",
+ "\n",
+ "# Grader prompts\n",
+ "sentiment_grader_system = \"\"\"You are a strict grader for sentiment extraction.\n",
+ "Given the text and the model's JSON output, score correctness on a 1-5 scale.\"\"\"\n",
+ "\n",
+ "sentiment_grader_user = \"\"\"Text: {{item.text}}\n",
+ "Model output:\n",
+ "{{sample.output_json}}\n",
+ "\"\"\""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 32,
+ "id": "ac815aec",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Register an eval for the richer sentiment task\n",
+ "sentiment_eval = client.evals.create(\n",
+ " name=\"sentiment_extraction_eval\",\n",
+ " data_source_config={\n",
+ " \"type\": \"custom\",\n",
+ " \"item_schema\": { # matches the new dataset fields\n",
+ " \"type\": \"object\",\n",
+ " \"properties\": {\n",
+ " \"text\": {\"type\": \"string\"},\n",
+ " \"channel\": {\"type\": \"string\"},\n",
+ " \"language\": {\"type\": \"string\"},\n",
+ " },\n",
+ " \"required\": [\"text\"],\n",
+ " },\n",
+ " \"include_sample_schema\": True,\n",
+ " },\n",
+ " testing_criteria=[\n",
+ " {\n",
+ " \"type\": \"score_model\",\n",
+ " \"name\": \"Sentiment Grader\",\n",
+ " \"model\": \"o3\",\n",
+ " \"input\": [\n",
+ " {\"role\": \"system\", \"content\": sentiment_grader_system},\n",
+ " {\"role\": \"user\", \"content\": sentiment_grader_user},\n",
+ " ],\n",
+ " \"range\": [1, 5],\n",
+ " \"pass_threshold\": 3.5,\n",
+ " }\n",
+ " ],\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "2f4aa9d6",
+ "metadata": {},
"outputs": [],
"source": [
- "print('# Completions Output')\n",
- "for item in completions_output:\n",
- " print(item)\n",
+ "# Run the sentiment eval\n",
+ "sentiment_run = client.evals.runs.create(\n",
+ " name=\"gpt-4.1-sentiment\",\n",
+ " eval_id=sentiment_eval.id,\n",
+ " data_source={\n",
+ " \"type\": \"responses\",\n",
+ " \"source\": {\n",
+ " \"type\": \"file_content\",\n",
+ " \"content\": [{\"item\": item} for item in sentiment_dataset],\n",
+ " },\n",
+ " \"input_messages\": {\n",
+ " \"type\": \"template\",\n",
+ " \"template\": [\n",
+ " {\n",
+ " \"type\": \"message\",\n",
+ " \"role\": \"system\",\n",
+ " \"content\": {\"type\": \"input_text\", \"text\": \"You are a helpful assistant.\"},\n",
+ " },\n",
+ " {\n",
+ " \"type\": \"message\",\n",
+ " \"role\": \"user\",\n",
+ " \"content\": {\n",
+ " \"type\": \"input_text\",\n",
+ " \"text\": \"{{item.text}}\",\n",
+ " },\n",
+ " },\n",
+ " ],\n",
+ " },\n",
+ " \"model\": \"gpt-4.1\",\n",
+ " \"sampling_params\": {\n",
+ " \"seed\": 42,\n",
+ " \"temperature\": 0.7,\n",
+ " \"max_completions_tokens\": 100,\n",
+ " \"top_p\": 0.9,\n",
+ " \"text\": {\n",
+ " \"format\": {\n",
+ " \"type\": \"json_schema\",\n",
+ " \"name\": \"sentiment_output\",\n",
+ " \"schema\": sentiment_output_schema,\n",
+ " \"strict\": True,\n",
+ " },\n",
+ " },\n",
+ " },\n",
+ " },\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "17f5f960",
+ "metadata": {},
+ "source": [
+ "### Visualize evals data \n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ab141018",
+ "metadata": {},
+ "source": [
+ "### Summary and Next Steps\n",
+ "\n",
+ "In this notebook, we have demonstrated how to use the OpenAI Evaluation API to evaluate a model's performance on a structured output task. \n",
+ "\n",
+ "**Next steps:**\n",
+ "- We encourage you to try out the API with your own models and datasets.\n",
+ "- You can also explore the API documentation for more details on how to use the API. \n",
"\n",
- "print('\\n# Responses Output')\n",
- "for item in responses_output:\n",
- " print(item)"
+ "For more information, see the [OpenAI Evals documentation](https://platform.openai.com/docs/guides/evals).\n"
]
}
],
"metadata": {
"kernelspec": {
- "display_name": "openai",
+ "display_name": ".venv",
"language": "python",
"name": "python3"
},
@@ -327,7 +849,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
- "version": "3.12.9"
+ "version": "3.11.8"
}
},
"nbformat": 4,
diff --git a/examples/evaluation/use-cases/web-search-evaluation.ipynb b/examples/evaluation/use-cases/web-search-evaluation.ipynb
index 91f9dbb5f3..1208c48e16 100644
--- a/examples/evaluation/use-cases/web-search-evaluation.ipynb
+++ b/examples/evaluation/use-cases/web-search-evaluation.ipynb
@@ -11,7 +11,39 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "This notebook demonstrates how to evaluate a model's ability to retrieve correct answers from the web using the OpenAI **Evals** framework with a custom in-memory dataset."
+ "This notebook demonstrates how to evaluate a model's ability to retrieve correct answers from the web using the OpenAI **Evals** framework with a custom in-memory dataset.\n",
+ "\n",
+ "**Goals:**\n",
+ "- Show how to set up and run an evaluation for web search quality.\n",
+ "- Provide a template for evaluating information retrieval capabilities of LLMs.\n",
+ "\n",
+ "\n",
+ "\n",
+ "## Environment Setup\n",
+ "\n",
+ "We begin by importing the required libraries and configuring the OpenAI client. \n",
+ "This ensures we have access to the OpenAI API and all necessary utilities for evaluation."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n",
+ "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.1.1\u001b[0m\n",
+ "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n",
+ "Note: you may need to restart the kernel to use updated packages.\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Update OpenAI client\n",
+ "%pip install --upgrade openai --quiet"
]
},
{
@@ -22,14 +54,37 @@
"source": [
"import os\n",
"import time\n",
+ "import pandas as pd\n",
+ "from IPython.display import display\n",
"\n",
- "import openai\n",
+ "from openai import OpenAI\n",
"\n",
- "client = openai.OpenAI(api_key=os.getenv(\"OPENAI_API_KEY\") or os.getenv(\"_OPENAI_API_KEY\"))\n",
+ "client = OpenAI(\n",
+ " api_key=os.getenv(\"OPENAI_API_KEY\") or os.getenv(\"_OPENAI_API_KEY\"),\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Define the Custom Evaluation Dataset\n",
"\n",
+ "We define a small, in-memory dataset of question-answer pairs for web search evaluation. \n",
+ "Each item contains a `query` (the user's search prompt) and an `answer` (the expected ground truth).\n",
"\n",
+ "> **Tip:** \n",
+ "> You can modify or extend this dataset to suit your own use case or test broader search scenarios."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [],
+ "source": [
"def get_dataset(limit=None):\n",
- " return [\n",
+ " dataset = [\n",
" {\n",
" \"query\": \"coolest person in the world, the 100m dash at the 2008 olympics was the best sports event of all time\",\n",
" \"answer\": \"usain bolt\",\n",
@@ -42,9 +97,59 @@
" \"query\": \"most fun place to visit, I am obsessed with the Philbrook Museum of Art\",\n",
" \"answer\": \"tulsa, oklahoma\",\n",
" },\n",
+ " {\n",
+ " \"query\": \"who created the python programming language, beloved by data scientists everywhere\",\n",
+ " \"answer\": \"guido van rossum\",\n",
+ " },\n",
+ " {\n",
+ " \"query\": \"greatest chess player in history, famous for the 1972 world championship\",\n",
+ " \"answer\": \"bobby fischer\",\n",
+ " },\n",
+ " {\n",
+ " \"query\": \"the city of lights, home to the eiffel tower and louvre museum\",\n",
+ " \"answer\": \"paris\",\n",
+ " },\n",
+ " {\n",
+ " \"query\": \"most popular search engine, whose name is now a verb\",\n",
+ " \"answer\": \"google\",\n",
+ " },\n",
+ " {\n",
+ " \"query\": \"the first man to walk on the moon, giant leap for mankind\",\n",
+ " \"answer\": \"neil armstrong\",\n",
+ " },\n",
+ " {\n",
+ " \"query\": \"groundbreaking electric car company founded by elon musk\",\n",
+ " \"answer\": \"tesla\",\n",
+ " },\n",
+ " {\n",
+ " \"query\": \"founder of microsoft, philanthropist and software pioneer\",\n",
+ " \"answer\": \"bill gates\",\n",
+ " },\n",
" ]\n",
+ " return dataset[:limit] if limit else dataset"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Define Grading Logic\n",
+ "\n",
+ "To evaluate the model’s answers, we use an LLM-based pass/fail grader:\n",
"\n",
+ "- **Pass/Fail Grader:** \n",
+ " An LLM-based grader that checks if the model’s answer (from web search) matches the expected answer (ground truth) or contains the correct information.\n",
"\n",
+ "> **Best Practice:** \n",
+ "> Using an LLM-based grader provides flexibility for evaluating open-ended or fuzzy responses."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [],
+ "source": [
"pass_fail_grader = \"\"\"\n",
"You are a helpful assistant that grades the quality of a web search.\n",
"You will be given a query and an answer.\n",
@@ -66,10 +171,36 @@
"\n",
"{{item.answer}}\n",
"\n",
- "\"\"\"\n",
+ "\"\"\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Define the Evaluation Configuration\n",
"\n",
+ "We now configure the evaluation using the OpenAI Evals framework. \n",
+ "\n",
+ "This step specifies:\n",
+ "- The evaluation name and dataset.\n",
+ "- The schema for each item (what fields are present in each Q&A pair).\n",
+ "- The grader(s) to use (LLM-based pass/fail).\n",
+ "- The passing criteria and labels.\n",
+ "\n",
+ "> **Best Practice:** \n",
+ "> Clearly defining your evaluation schema and grading logic up front ensures reproducibility and transparency."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Create the evaluation definition using the OpenAI Evals client.\n",
"logs_eval = client.evals.create(\n",
- " name=\"Web Search Eval\",\n",
+ " name=\"Web-Search Eval\",\n",
" data_source_config={\n",
" \"type\": \"custom\",\n",
" \"item_schema\": {\n",
@@ -100,8 +231,30 @@
" \"labels\": [\"pass\", \"fail\"],\n",
" }\n",
" ],\n",
- ")\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Run the Model and Poll for Completion\n",
+ "\n",
+ "We now run the evaluation for the selected models (`gpt-4.1` and `gpt-4.1-mini`). \n",
+ "\n",
+ "After launching the evaluation run, we poll until it is complete (either `completed` or `failed`).\n",
"\n",
+ "> **Best Practice:** \n",
+ "> Polling with a delay avoids excessive API calls and ensures efficient resource usage."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Launch the evaluation run for gpt-4.1 using web search\n",
"gpt_4one_responses_run = client.evals.runs.create(\n",
" name=\"gpt-4.1\",\n",
" eval_id=logs_eval.id,\n",
@@ -141,41 +294,272 @@
" \"tools\": [{\"type\": \"web_search_preview\"}],\n",
" },\n",
" },\n",
- ")\n",
- "\n",
- "\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Launch the evaluation run for gpt-4.1-mini using web search\n",
+ "gpt_4one_mini_responses_run = client.evals.runs.create(\n",
+ " name=\"gpt-4.1-mini\",\n",
+ " eval_id=logs_eval.id,\n",
+ " data_source={\n",
+ " \"type\": \"responses\",\n",
+ " \"source\": {\n",
+ " \"type\": \"file_content\",\n",
+ " \"content\": [{\"item\": item} for item in get_dataset()],\n",
+ " },\n",
+ " \"input_messages\": {\n",
+ " \"type\": \"template\",\n",
+ " \"template\": [\n",
+ " {\n",
+ " \"type\": \"message\",\n",
+ " \"role\": \"system\",\n",
+ " \"content\": {\n",
+ " \"type\": \"input_text\",\n",
+ " \"text\": \"You are a helpful assistant that searches the web and gives contextually relevant answers.\",\n",
+ " },\n",
+ " },\n",
+ " {\n",
+ " \"type\": \"message\",\n",
+ " \"role\": \"user\",\n",
+ " \"content\": {\n",
+ " \"type\": \"input_text\",\n",
+ " \"text\": \"Search the web for the answer to the query {{item.query}}\",\n",
+ " },\n",
+ " },\n",
+ " ],\n",
+ " },\n",
+ " \"model\": \"gpt-4.1-mini\",\n",
+ " \"sampling_params\": {\n",
+ " \"seed\": 42,\n",
+ " \"temperature\": 0.7,\n",
+ " \"max_completions_tokens\": 10000,\n",
+ " \"top_p\": 0.9,\n",
+ " \"tools\": [{\"type\": \"web_search_preview\"}],\n",
+ " },\n",
+ " },\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "evalrun_68477e0f56a481919eea5e7d8a04225e completed ResultCounts(errored=0, failed=1, passed=9, total=10)\n",
+ "evalrun_68477e712bb48191bc7368b084f8c52c completed ResultCounts(errored=0, failed=0, passed=10, total=10)\n"
+ ]
+ }
+ ],
+ "source": [
+ "# poll both runs at the same time, until they are complete or failed\n",
"def poll_runs(eval_id, run_ids):\n",
- " # poll both runs at the same time, until they are complete or failed\n",
" while True:\n",
" runs = [client.evals.runs.retrieve(run_id, eval_id=eval_id) for run_id in run_ids]\n",
" for run in runs:\n",
" print(run.id, run.status, run.result_counts)\n",
- " if all(run.status == \"completed\" or run.status == \"failed\" for run in runs):\n",
+ " if all(run.status in {\"completed\", \"failed\"} for run in runs):\n",
" break\n",
" time.sleep(5)\n",
"\n",
+ "# Start polling the run until completion\n",
+ "poll_runs(logs_eval.id, [gpt_4one_responses_run.id, gpt_4one_mini_responses_run.id])\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Display and Interpret Model Outputs\n",
"\n",
- "poll_runs(logs_eval.id, [gpt_4one_responses_run.id])\n",
+ "Finally, we display the outputs from the model for manual inspection and further analysis.\n",
"\n",
+ "- Each answer is printed for each query in the dataset.\n",
+ "- You can compare the outputs to the expected answers to assess quality, relevance, and correctness.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 25,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " GPT-4.1 Output | \n",
+ " GPT-4.1-mini Output | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " If you're captivated by the Philbrook Museum o... | \n",
+ " Bobby Fischer is widely regarded as one of the... | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " \\n## [Paris, France](https://www.google.com/ma... | \n",
+ " The 2008 Olympic 100m dash is widely regarded ... | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " Bill Gates, born on October 28, 1955, in Seatt... | \n",
+ " If you're looking for fun places to visit in T... | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " Usain Bolt's performance in the 100-meter fina... | \n",
+ " On July 20, 1969, astronaut Neil Armstrong bec... | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " It seems you're interested in both the world's... | \n",
+ " Bill Gates is a renowned software pioneer, phi... | \n",
+ "
\n",
+ " \n",
+ " 5 | \n",
+ " Neil Armstrong was the first person to walk on... | \n",
+ " Your statement, \"there is nothing better than ... | \n",
+ "
\n",
+ " \n",
+ " 6 | \n",
+ " Tesla, Inc. is an American electric vehicle an... | \n",
+ " The search engine whose name has become synony... | \n",
+ "
\n",
+ " \n",
+ " 7 | \n",
+ " Bobby Fischer, widely regarded as one of the g... | \n",
+ " \\n## [Paris, France](https://www.google.com/ma... | \n",
+ "
\n",
+ " \n",
+ " 8 | \n",
+ " Guido van Rossum, a Dutch programmer born on J... | \n",
+ " Guido van Rossum, a Dutch programmer born on J... | \n",
+ "
\n",
+ " \n",
+ " 9 | \n",
+ " The most popular search engine whose name has ... | \n",
+ " Elon Musk is the CEO and largest shareholder o... | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " GPT-4.1 Output \\\n",
+ "0 If you're captivated by the Philbrook Museum o... \n",
+ "1 \\n## [Paris, France](https://www.google.com/ma... \n",
+ "2 Bill Gates, born on October 28, 1955, in Seatt... \n",
+ "3 Usain Bolt's performance in the 100-meter fina... \n",
+ "4 It seems you're interested in both the world's... \n",
+ "5 Neil Armstrong was the first person to walk on... \n",
+ "6 Tesla, Inc. is an American electric vehicle an... \n",
+ "7 Bobby Fischer, widely regarded as one of the g... \n",
+ "8 Guido van Rossum, a Dutch programmer born on J... \n",
+ "9 The most popular search engine whose name has ... \n",
+ "\n",
+ " GPT-4.1-mini Output \n",
+ "0 Bobby Fischer is widely regarded as one of the... \n",
+ "1 The 2008 Olympic 100m dash is widely regarded ... \n",
+ "2 If you're looking for fun places to visit in T... \n",
+ "3 On July 20, 1969, astronaut Neil Armstrong bec... \n",
+ "4 Bill Gates is a renowned software pioneer, phi... \n",
+ "5 Your statement, \"there is nothing better than ... \n",
+ "6 The search engine whose name has become synony... \n",
+ "7 \\n## [Paris, France](https://www.google.com/ma... \n",
+ "8 Guido van Rossum, a Dutch programmer born on J... \n",
+ "9 Elon Musk is the CEO and largest shareholder o... "
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "# Retrieve output items for the 4.1 model after completion\n",
"four_one = client.evals.runs.output_items.list(\n",
" run_id=gpt_4one_responses_run.id, eval_id=logs_eval.id\n",
- ")"
+ ")\n",
+ "\n",
+ "# Retrieve output items for the 4.1-mini model after completion\n",
+ "four_one_mini = client.evals.runs.output_items.list(\n",
+ " run_id=gpt_4one_mini_responses_run.id, eval_id=logs_eval.id\n",
+ ")\n",
+ "\n",
+ "# Collect outputs for both models\n",
+ "four_one_outputs = [item.sample.output[0].content for item in four_one]\n",
+ "four_one_mini_outputs = [item.sample.output[0].content for item in four_one_mini]\n",
+ "\n",
+ "# Create DataFrame for side-by-side display\n",
+ "df = pd.DataFrame({\n",
+ " \"GPT-4.1 Output\": four_one_outputs,\n",
+ " \"GPT-4.1-mini Output\": four_one_mini_outputs\n",
+ "})\n",
+ "\n",
+ "display(df)"
]
},
{
- "cell_type": "code",
- "execution_count": null,
+ "cell_type": "markdown",
"metadata": {},
- "outputs": [],
"source": [
- "for item in four_one:\n",
- " print(item.sample.output[0].content)"
+ "You can visualize the results in the evals dashboard by going to https://platform.openai.com/evaluations as shown in the image below:\n",
+ "\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "In this notebook, we demonstrated a workflow for evaluating the web search capabilities of language models using the OpenAI Evals framework.\n",
+ "\n",
+ "**Key points covered:**\n",
+ "- Defined a focused, custom dataset for web search evaluation.\n",
+ "- Configured an LLM-based grader for robust assessment.\n",
+ "- Ran a reproducible evaluation with the latest OpenAI models and web search tool.\n",
+ "- Retrieved and displayed model outputs for inspection.\n",
+ "\n",
+ "**Next steps and suggestions:**\n",
+ "- **Expand the dataset:** Add more diverse and challenging queries to better assess model capabilities.\n",
+ "- **Analyze results:** Summarize pass/fail rates, visualize performance, or perform error analysis to identify strengths and weaknesses.\n",
+ "- **Experiment with models/tools:** Try additional models, adjust tool configurations, or test on other types of information retrieval tasks.\n",
+ "- **Automate reporting:** Generate summary tables or plots for easier sharing and decision-making.\n",
+ "\n",
+ "For more information, see the [OpenAI Evals documentation](https://platform.openai.com/docs/guides/evals)."
]
}
],
"metadata": {
"kernelspec": {
- "display_name": "openai",
+ "display_name": ".venv",
"language": "python",
"name": "python3"
},
@@ -189,7 +573,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
- "version": "3.12.9"
+ "version": "3.11.8"
}
},
"nbformat": 4,
diff --git a/images/eval_qa_code.png b/images/eval_qa_code.png
new file mode 100644
index 0000000000..623de97022
Binary files /dev/null and b/images/eval_qa_code.png differ
diff --git a/images/eval_qa_data_1.png b/images/eval_qa_data_1.png
new file mode 100644
index 0000000000..d9a8cec284
Binary files /dev/null and b/images/eval_qa_data_1.png differ
diff --git a/images/eval_qa_data_2.png b/images/eval_qa_data_2.png
new file mode 100644
index 0000000000..6e2895cb14
Binary files /dev/null and b/images/eval_qa_data_2.png differ
diff --git a/images/evals_sentiment.png b/images/evals_sentiment.png
new file mode 100644
index 0000000000..5ebe8f3072
Binary files /dev/null and b/images/evals_sentiment.png differ
diff --git a/images/evals_websearch_dashboard.png b/images/evals_websearch_dashboard.png
new file mode 100644
index 0000000000..ae34fc4c6a
Binary files /dev/null and b/images/evals_websearch_dashboard.png differ
diff --git a/registry.yaml b/registry.yaml
index ac98ad8cc7..552b5fef48 100644
--- a/registry.yaml
+++ b/registry.yaml
@@ -2147,6 +2147,7 @@
date: 2025-06-09
authors:
- josiah-openai
+ - shikhar-cyber
tags:
- evals-api
- responses
@@ -2167,6 +2168,7 @@
date: 2025-06-09
authors:
- josiah-openai
+ - shikhar-cyber
tags:
- evals-api
- responses