diff --git a/examples/evaluation/use-cases/structured-outputs-evaluation.ipynb b/examples/evaluation/use-cases/structured-outputs-evaluation.ipynb index d255fe79aa..37c21f450a 100644 --- a/examples/evaluation/use-cases/structured-outputs-evaluation.ipynb +++ b/examples/evaluation/use-cases/structured-outputs-evaluation.ipynb @@ -1,11 +1,92 @@ { "cells": [ + { + "cell_type": "markdown", + "id": "0a2d56c0", + "metadata": {}, + "source": [ + "\n", + "# Structured Output Evaluation Cookbook\n", + " \n", + "This notebook walks you through a set of focused, runnable examples how to use the OpenAI **Evals** framework to **test, grade, and iterate on tasks that require large‑language models to produce structured outputs**.\n", + "\n", + "> **Why does this matter?** \n", + "> Production systems often depend on JSON, SQL, or domain‑specific formats. Relying on spot checks or ad‑hoc prompt tweaks quickly breaks down. Instead, you can *codify* expectations as automated evals and let your team ship with safety bricks instead of sand.\n" + ] + }, + { + "cell_type": "markdown", + "id": "45eee293", + "metadata": {}, + "source": [ + "\n", + "## Quick Tour\n", + "\n", + "* **Section 1 – Prerequisites**: environment variables and package setup \n", + "* **Section 2 – Walk‑through: Code‑symbol extraction**: end‑to‑end demo that grades the model’s ability to extract function and class names from source code. We keep the original logic intact and simply layer documentation around it. \n", + "* **Section 3 – Additional Recipes**: sketches of common production patterns such as sentiment extraction as additional code sample for evaluation.\n", + "* **Section 4 – Result Exploration**: lightweight helpers for pulling run output and digging into failures. \n" + ] + }, + { + "cell_type": "markdown", + "id": "e027be46", + "metadata": {}, + "source": [ + "\n", + "## Prerequisites\n", + "\n", + "1. **Install dependencies** (minimum versions shown):\n", + "\n", + "```bash\n", + "pip install --upgrade openai\n", + "```\n", + "\n", + "2. **Authenticate** by exporting your key:\n", + "\n", + "```bash\n", + "export OPENAI_API_KEY=\"sk‑...\"\n", + "```\n", + "\n", + "3. **Optional**: if you plan to run evals in bulk, set up an [organization‑level key](https://platform.openai.com/account/org-settings) with appropriate limits.\n" + ] + }, + { + "cell_type": "markdown", + "id": "4592675d", + "metadata": {}, + "source": [ + "### Use Case 1: Code symbol extraction" + ] + }, + { + "cell_type": "markdown", + "id": "d2a32d53", + "metadata": {}, + "source": [ + "\n", + "The goal is to **extract all function, class, and constant symbols from python files inside the OpenAI SDK**. \n", + "For each file we ask the model to emit structured JSON like:\n", + "\n", + "```json\n", + "{\n", + " \"symbols\": [\n", + " {\"name\": \"OpenAI\", \"kind\": \"class\"},\n", + " {\"name\": \"Evals\", \"kind\": \"module\"},\n", + " ...\n", + " ]\n", + "}\n", + "```\n", + "\n", + "A rubric model then grades **completeness** (did we capture every symbol?) and **quality** (are the kinds correct?) on a 1‑7 scale.\n" + ] + }, { "cell_type": "markdown", "id": "9dd88e7c", "metadata": {}, "source": [ - "# Evaluating Code Quality Extraction with a Custom Dataset" + "### Evaluating Code Quality Extraction with a Custom Dataset" ] }, { @@ -13,28 +94,65 @@ "id": "64bf0667", "metadata": {}, "source": [ - "This notebook demonstrates how to evaluate a model's ability to extract symbols from code using the OpenAI **Evals** framework with a custom in-memory dataset." + "Let us walk though an example to evaluate a model's ability to extract symbols from code using the OpenAI **Evals** framework with a custom in-memory dataset." + ] + }, + { + "cell_type": "markdown", + "id": "c95faa47", + "metadata": {}, + "source": [ + "### Initialize SDK client\n", + "Creates an `openai.OpenAI` client using the `OPENAI_API_KEY` we exported above. Nothing will run without this." ] }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 11, "id": "eacc6ac7", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.1.1\u001b[0m\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n", + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], "source": [ + "%pip install --upgrade openai pandas rich --quiet\n", + "\n", + "\n", + "\n", "import os\n", "import time\n", "import openai\n", + "from rich import print\n", + "import pandas as pd\n", "\n", "client = openai.OpenAI(\n", " api_key=os.getenv(\"OPENAI_API_KEY\") or os.getenv(\"_OPENAI_API_KEY\"),\n", ")" ] }, + { + "cell_type": "markdown", + "id": "8200aaf1", + "metadata": {}, + "source": [ + "### Dataset factory & grading rubric\n", + "* `get_dataset` builds a small in-memory dataset by reading several SDK files.\n", + "* `structured_output_grader` defines a detailed evaluation rubric.\n", + "* `client.evals.create(...)` registers the eval with the platform." + ] + }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 4, "id": "b272e193", "metadata": {}, "outputs": [], @@ -110,13 +228,23 @@ ")" ] }, + { + "cell_type": "markdown", + "id": "4e77cbe6", + "metadata": {}, + "source": [ + "### Kick off model runs\n", + "Here we launch two runs against the same eval: one that calls the **Completions** endpoint, and one that calls the **Responses** endpoint." + ] + }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 5, "id": "18f357e6", "metadata": {}, "outputs": [], "source": [ + "### Kick off model runs\n", "gpt_4one_completions_run = client.evals.runs.create(\n", " name=\"gpt-4.1\",\n", " eval_id=logs_eval.id,\n", @@ -251,13 +379,54 @@ ")" ] }, + { + "cell_type": "markdown", + "id": "dd0aa0c0", + "metadata": {}, + "source": [ + "### Utility poller\n", + "Next, we will use a simple loop that waits for all runs to finish, then saves each run’s JSON to disk so you can inspect it later or attach it to CI artifacts." + ] + }, { "cell_type": "code", - "execution_count": null, + "execution_count": 7, "id": "cbc4f775", "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/html": [ + "
evalrun_68487dcc749081918ec2571e76cc9ef6 completed\n",
+       "ResultCounts(errored=0, failed=1, passed=0, total=1)\n",
+       "
\n" + ], + "text/plain": [ + "evalrun_68487dcc749081918ec2571e76cc9ef6 completed\n", + "\u001b[1;35mResultCounts\u001b[0m\u001b[1m(\u001b[0m\u001b[33merrored\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mfailed\u001b[0m=\u001b[1;36m1\u001b[0m, \u001b[33mpassed\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mtotal\u001b[0m=\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
evalrun_68487dcdaba0819182db010fe5331f2e completed\n",
+       "ResultCounts(errored=0, failed=1, passed=0, total=1)\n",
+       "
\n" + ], + "text/plain": [ + "evalrun_68487dcdaba0819182db010fe5331f2e completed\n", + "\u001b[1;35mResultCounts\u001b[0m\u001b[1m(\u001b[0m\u001b[33merrored\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mfailed\u001b[0m=\u001b[1;36m1\u001b[0m, \u001b[33mpassed\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mtotal\u001b[0m=\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], "source": [ + "### Utility poller\n", "def poll_runs(eval_id, run_ids):\n", " while True:\n", " runs = [client.evals.runs.retrieve(rid, eval_id=eval_id) for rid in run_ids]\n", @@ -278,9 +447,18 @@ "poll_runs(logs_eval.id, [gpt_4one_completions_run.id, gpt_4one_responses_run.id])" ] }, + { + "cell_type": "markdown", + "id": "77331859", + "metadata": {}, + "source": [ + "### Load outputs for quick inspection\n", + "We will fetch the output items for both runs so we can print or post‑process them." + ] + }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 8, "id": "c316e6eb", "metadata": {}, "outputs": [], @@ -294,26 +472,370 @@ ")" ] }, + { + "cell_type": "markdown", + "id": "1cc61c54", + "metadata": {}, + "source": [ + "### Human-readable dump\n", + "Let us print a side-by-side view of completions vs responses." + ] + }, { "cell_type": "code", - "execution_count": null, + "execution_count": 20, "id": "9f1b502e", "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "

\n", + "Completions vs Responses Output\n", + "

\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Completions OutputResponses Output
{\"symbols\":[{\"name\":\"Evals\",\"symbol_type\":\"class\"},{\"name\":\"AsyncEvals\",\"symbol_type\":\"class\"},{\"name\":\"EvalsWithRawResponse\",\"symbol_type\":\"class\"},{\"name\":\"AsyncEvalsWithRawResponse\",\"symbol_type\":\"class\"},{\"name\":\"EvalsWithStreamingResponse\",\"symb...{\"symbols\":[{\"name\":\"Evals\",\"symbol_type\":\"class\"},{\"name\":\"runs\",\"symbol_type\":\"property\"},{\"name\":\"with_raw_response\",\"symbol_type\":\"property\"},{\"name\":\"with_streaming_response\",\"symbol_type\":\"property\"},{\"name\":\"create\",\"symbol_type\":\"function\"},{...
\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import display, HTML\n", + "\n", + "# Collect outputs for both runs\n", + "completions_outputs = [item.sample.output[0].content for item in completions_output]\n", + "responses_outputs = [item.sample.output[0].content for item in responses_output]\n", + "\n", + "# Create DataFrame for side-by-side display (truncated to 250 chars for readability)\n", + "df = pd.DataFrame({\n", + " \"Completions Output\": [c[:250].replace('\\n', ' ') + ('...' if len(c) > 250 else '') for c in completions_outputs],\n", + " \"Responses Output\": [r[:250].replace('\\n', ' ') + ('...' if len(r) > 250 else '') for r in responses_outputs]\n", + "})\n", + "\n", + "# Custom color scheme\n", + "custom_styles = [\n", + " {'selector': 'th', 'props': [('font-size', '1.1em'), ('background-color', '#323C50'), ('color', '#FFFFFF'), ('border-bottom', '2px solid #1CA7EC')]},\n", + " {'selector': 'td', 'props': [('font-size', '1em'), ('max-width', '650px'), ('background-color', '#F6F8FA'), ('color', '#222'), ('border-bottom', '1px solid #DDD')]},\n", + " {'selector': 'tr:hover td', 'props': [('background-color', '#D1ECF1'), ('color', '#18647E')]},\n", + " {'selector': 'tbody tr:nth-child(even) td', 'props': [('background-color', '#E8F1FB')]},\n", + " {'selector': 'tbody tr:nth-child(odd) td', 'props': [('background-color', '#F6F8FA')]},\n", + " {'selector': 'table', 'props': [('border-collapse', 'collapse'), ('border-radius', '6px'), ('overflow', 'hidden')]},\n", + "]\n", + "\n", + "styled = (\n", + " df.style\n", + " .set_properties(**{'white-space': 'pre-wrap', 'word-break': 'break-word', 'padding': '8px'})\n", + " .set_table_styles(custom_styles)\n", + " .hide(axis=\"index\")\n", + ")\n", + "\n", + "display(HTML(\"\"\"\n", + "

\n", + "Completions vs Responses Output\n", + "

\n", + "\"\"\"))\n", + "display(styled)" + ] + }, + { + "cell_type": "markdown", + "id": "8cbe934f", + "metadata": {}, + "source": [ + "### Visualize the Results\n", + "\n", + "Below are visualizations that represent the evaluation data and code outputs for structured QA evaluation. These images provide insights into the data distribution and the evaluation workflow.\n", + "\n", + "---\n", + "\n", + "**Evaluation Data Overview**\n", + "\n", + "![Evaluation Data Part 1](../../../images/eval_qa_data_1.png)\n", + "\n", + "![Evaluation Data Part 2](../../../images/eval_qa_data_2.png)\n", + "\n", + "---\n", + "\n", + "**Evaluation Code Workflow**\n", + "\n", + "![Evaluation Code Structure](../../../images/eval_qa_code.png)\n", + "\n", + "---\n", + "\n", + "By reviewing these visualizations, you can better understand the structure of the evaluation dataset and the steps involved in evaluating structured outputs for QA tasks.\n" + ] + }, + { + "cell_type": "markdown", + "id": "a0ae89ef", + "metadata": {}, + "source": [ + "### Use Case 2: Multi-lingual Sentiment Extraction\n", + "In a similar way, let us evaluate a multi-lingual sentiment extraction model with structured outputs." + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "id": "e5f0b782", + "metadata": {}, + "outputs": [], + "source": [ + "# Sample in-memory dataset for sentiment extraction\n", + "sentiment_dataset = [\n", + " {\n", + " \"text\": \"I love this product!\",\n", + " \"channel\": \"twitter\",\n", + " \"language\": \"en\"\n", + " },\n", + " {\n", + " \"text\": \"This is the worst experience I've ever had.\",\n", + " \"channel\": \"support_ticket\",\n", + " \"language\": \"en\"\n", + " },\n", + " {\n", + " \"text\": \"It's okay – not great but not bad either.\",\n", + " \"channel\": \"app_review\",\n", + " \"language\": \"en\"\n", + " },\n", + " {\n", + " \"text\": \"No estoy seguro de lo que pienso sobre este producto.\",\n", + " \"channel\": \"facebook\",\n", + " \"language\": \"es\"\n", + " },\n", + " {\n", + " \"text\": \"总体来说,我对这款产品很满意。\",\n", + " \"channel\": \"wechat\",\n", + " \"language\": \"zh\"\n", + " },\n", + "]" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "id": "cb6954f4", + "metadata": {}, + "outputs": [], + "source": [ + "# Define output schema\n", + "sentiment_output_schema = {\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"sentiment\": {\n", + " \"type\": \"string\",\n", + " \"description\": \"overall label: positive / negative / neutral\"\n", + " },\n", + " \"confidence\": {\n", + " \"type\": \"number\",\n", + " \"description\": \"confidence score 0-1\"\n", + " },\n", + " \"emotions\": {\n", + " \"type\": \"array\",\n", + " \"description\": \"list of dominant emotions (e.g. joy, anger)\",\n", + " \"items\": {\"type\": \"string\"}\n", + " }\n", + " },\n", + " \"required\": [\"sentiment\", \"confidence\", \"emotions\"],\n", + " \"additionalProperties\": False\n", + "}\n", + "\n", + "# Grader prompts\n", + "sentiment_grader_system = \"\"\"You are a strict grader for sentiment extraction.\n", + "Given the text and the model's JSON output, score correctness on a 1-5 scale.\"\"\"\n", + "\n", + "sentiment_grader_user = \"\"\"Text: {{item.text}}\n", + "Model output:\n", + "{{sample.output_json}}\n", + "\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "id": "ac815aec", + "metadata": {}, + "outputs": [], + "source": [ + "# Register an eval for the richer sentiment task\n", + "sentiment_eval = client.evals.create(\n", + " name=\"sentiment_extraction_eval\",\n", + " data_source_config={\n", + " \"type\": \"custom\",\n", + " \"item_schema\": { # matches the new dataset fields\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"text\": {\"type\": \"string\"},\n", + " \"channel\": {\"type\": \"string\"},\n", + " \"language\": {\"type\": \"string\"},\n", + " },\n", + " \"required\": [\"text\"],\n", + " },\n", + " \"include_sample_schema\": True,\n", + " },\n", + " testing_criteria=[\n", + " {\n", + " \"type\": \"score_model\",\n", + " \"name\": \"Sentiment Grader\",\n", + " \"model\": \"o3\",\n", + " \"input\": [\n", + " {\"role\": \"system\", \"content\": sentiment_grader_system},\n", + " {\"role\": \"user\", \"content\": sentiment_grader_user},\n", + " ],\n", + " \"range\": [1, 5],\n", + " \"pass_threshold\": 3.5,\n", + " }\n", + " ],\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2f4aa9d6", + "metadata": {}, "outputs": [], "source": [ - "print('# Completions Output')\n", - "for item in completions_output:\n", - " print(item)\n", + "# Run the sentiment eval\n", + "sentiment_run = client.evals.runs.create(\n", + " name=\"gpt-4.1-sentiment\",\n", + " eval_id=sentiment_eval.id,\n", + " data_source={\n", + " \"type\": \"responses\",\n", + " \"source\": {\n", + " \"type\": \"file_content\",\n", + " \"content\": [{\"item\": item} for item in sentiment_dataset],\n", + " },\n", + " \"input_messages\": {\n", + " \"type\": \"template\",\n", + " \"template\": [\n", + " {\n", + " \"type\": \"message\",\n", + " \"role\": \"system\",\n", + " \"content\": {\"type\": \"input_text\", \"text\": \"You are a helpful assistant.\"},\n", + " },\n", + " {\n", + " \"type\": \"message\",\n", + " \"role\": \"user\",\n", + " \"content\": {\n", + " \"type\": \"input_text\",\n", + " \"text\": \"{{item.text}}\",\n", + " },\n", + " },\n", + " ],\n", + " },\n", + " \"model\": \"gpt-4.1\",\n", + " \"sampling_params\": {\n", + " \"seed\": 42,\n", + " \"temperature\": 0.7,\n", + " \"max_completions_tokens\": 100,\n", + " \"top_p\": 0.9,\n", + " \"text\": {\n", + " \"format\": {\n", + " \"type\": \"json_schema\",\n", + " \"name\": \"sentiment_output\",\n", + " \"schema\": sentiment_output_schema,\n", + " \"strict\": True,\n", + " },\n", + " },\n", + " },\n", + " },\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "17f5f960", + "metadata": {}, + "source": [ + "### Visualize evals data \n", + "![image](../../../images/evals_sentiment.png)" + ] + }, + { + "cell_type": "markdown", + "id": "ab141018", + "metadata": {}, + "source": [ + "### Summary and Next Steps\n", + "\n", + "In this notebook, we have demonstrated how to use the OpenAI Evaluation API to evaluate a model's performance on a structured output task. \n", + "\n", + "**Next steps:**\n", + "- We encourage you to try out the API with your own models and datasets.\n", + "- You can also explore the API documentation for more details on how to use the API. \n", "\n", - "print('\\n# Responses Output')\n", - "for item in responses_output:\n", - " print(item)" + "For more information, see the [OpenAI Evals documentation](https://platform.openai.com/docs/guides/evals).\n" ] } ], "metadata": { "kernelspec": { - "display_name": "openai", + "display_name": ".venv", "language": "python", "name": "python3" }, @@ -327,7 +849,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.9" + "version": "3.11.8" } }, "nbformat": 4, diff --git a/examples/evaluation/use-cases/web-search-evaluation.ipynb b/examples/evaluation/use-cases/web-search-evaluation.ipynb index 91f9dbb5f3..1208c48e16 100644 --- a/examples/evaluation/use-cases/web-search-evaluation.ipynb +++ b/examples/evaluation/use-cases/web-search-evaluation.ipynb @@ -11,7 +11,39 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "This notebook demonstrates how to evaluate a model's ability to retrieve correct answers from the web using the OpenAI **Evals** framework with a custom in-memory dataset." + "This notebook demonstrates how to evaluate a model's ability to retrieve correct answers from the web using the OpenAI **Evals** framework with a custom in-memory dataset.\n", + "\n", + "**Goals:**\n", + "- Show how to set up and run an evaluation for web search quality.\n", + "- Provide a template for evaluating information retrieval capabilities of LLMs.\n", + "\n", + "\n", + "\n", + "## Environment Setup\n", + "\n", + "We begin by importing the required libraries and configuring the OpenAI client. \n", + "This ensures we have access to the OpenAI API and all necessary utilities for evaluation." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.1.1\u001b[0m\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n", + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], + "source": [ + "# Update OpenAI client\n", + "%pip install --upgrade openai --quiet" ] }, { @@ -22,14 +54,37 @@ "source": [ "import os\n", "import time\n", + "import pandas as pd\n", + "from IPython.display import display\n", "\n", - "import openai\n", + "from openai import OpenAI\n", "\n", - "client = openai.OpenAI(api_key=os.getenv(\"OPENAI_API_KEY\") or os.getenv(\"_OPENAI_API_KEY\"))\n", + "client = OpenAI(\n", + " api_key=os.getenv(\"OPENAI_API_KEY\") or os.getenv(\"_OPENAI_API_KEY\"),\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Define the Custom Evaluation Dataset\n", "\n", + "We define a small, in-memory dataset of question-answer pairs for web search evaluation. \n", + "Each item contains a `query` (the user's search prompt) and an `answer` (the expected ground truth).\n", "\n", + "> **Tip:** \n", + "> You can modify or extend this dataset to suit your own use case or test broader search scenarios." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ "def get_dataset(limit=None):\n", - " return [\n", + " dataset = [\n", " {\n", " \"query\": \"coolest person in the world, the 100m dash at the 2008 olympics was the best sports event of all time\",\n", " \"answer\": \"usain bolt\",\n", @@ -42,9 +97,59 @@ " \"query\": \"most fun place to visit, I am obsessed with the Philbrook Museum of Art\",\n", " \"answer\": \"tulsa, oklahoma\",\n", " },\n", + " {\n", + " \"query\": \"who created the python programming language, beloved by data scientists everywhere\",\n", + " \"answer\": \"guido van rossum\",\n", + " },\n", + " {\n", + " \"query\": \"greatest chess player in history, famous for the 1972 world championship\",\n", + " \"answer\": \"bobby fischer\",\n", + " },\n", + " {\n", + " \"query\": \"the city of lights, home to the eiffel tower and louvre museum\",\n", + " \"answer\": \"paris\",\n", + " },\n", + " {\n", + " \"query\": \"most popular search engine, whose name is now a verb\",\n", + " \"answer\": \"google\",\n", + " },\n", + " {\n", + " \"query\": \"the first man to walk on the moon, giant leap for mankind\",\n", + " \"answer\": \"neil armstrong\",\n", + " },\n", + " {\n", + " \"query\": \"groundbreaking electric car company founded by elon musk\",\n", + " \"answer\": \"tesla\",\n", + " },\n", + " {\n", + " \"query\": \"founder of microsoft, philanthropist and software pioneer\",\n", + " \"answer\": \"bill gates\",\n", + " },\n", " ]\n", + " return dataset[:limit] if limit else dataset" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Define Grading Logic\n", + "\n", + "To evaluate the model’s answers, we use an LLM-based pass/fail grader:\n", "\n", + "- **Pass/Fail Grader:** \n", + " An LLM-based grader that checks if the model’s answer (from web search) matches the expected answer (ground truth) or contains the correct information.\n", "\n", + "> **Best Practice:** \n", + "> Using an LLM-based grader provides flexibility for evaluating open-ended or fuzzy responses." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ "pass_fail_grader = \"\"\"\n", "You are a helpful assistant that grades the quality of a web search.\n", "You will be given a query and an answer.\n", @@ -66,10 +171,36 @@ "\n", "{{item.answer}}\n", "\n", - "\"\"\"\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Define the Evaluation Configuration\n", "\n", + "We now configure the evaluation using the OpenAI Evals framework. \n", + "\n", + "This step specifies:\n", + "- The evaluation name and dataset.\n", + "- The schema for each item (what fields are present in each Q&A pair).\n", + "- The grader(s) to use (LLM-based pass/fail).\n", + "- The passing criteria and labels.\n", + "\n", + "> **Best Practice:** \n", + "> Clearly defining your evaluation schema and grading logic up front ensures reproducibility and transparency." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "# Create the evaluation definition using the OpenAI Evals client.\n", "logs_eval = client.evals.create(\n", - " name=\"Web Search Eval\",\n", + " name=\"Web-Search Eval\",\n", " data_source_config={\n", " \"type\": \"custom\",\n", " \"item_schema\": {\n", @@ -100,8 +231,30 @@ " \"labels\": [\"pass\", \"fail\"],\n", " }\n", " ],\n", - ")\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Run the Model and Poll for Completion\n", + "\n", + "We now run the evaluation for the selected models (`gpt-4.1` and `gpt-4.1-mini`). \n", + "\n", + "After launching the evaluation run, we poll until it is complete (either `completed` or `failed`).\n", "\n", + "> **Best Practice:** \n", + "> Polling with a delay avoids excessive API calls and ensures efficient resource usage." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "# Launch the evaluation run for gpt-4.1 using web search\n", "gpt_4one_responses_run = client.evals.runs.create(\n", " name=\"gpt-4.1\",\n", " eval_id=logs_eval.id,\n", @@ -141,41 +294,272 @@ " \"tools\": [{\"type\": \"web_search_preview\"}],\n", " },\n", " },\n", - ")\n", - "\n", - "\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "# Launch the evaluation run for gpt-4.1-mini using web search\n", + "gpt_4one_mini_responses_run = client.evals.runs.create(\n", + " name=\"gpt-4.1-mini\",\n", + " eval_id=logs_eval.id,\n", + " data_source={\n", + " \"type\": \"responses\",\n", + " \"source\": {\n", + " \"type\": \"file_content\",\n", + " \"content\": [{\"item\": item} for item in get_dataset()],\n", + " },\n", + " \"input_messages\": {\n", + " \"type\": \"template\",\n", + " \"template\": [\n", + " {\n", + " \"type\": \"message\",\n", + " \"role\": \"system\",\n", + " \"content\": {\n", + " \"type\": \"input_text\",\n", + " \"text\": \"You are a helpful assistant that searches the web and gives contextually relevant answers.\",\n", + " },\n", + " },\n", + " {\n", + " \"type\": \"message\",\n", + " \"role\": \"user\",\n", + " \"content\": {\n", + " \"type\": \"input_text\",\n", + " \"text\": \"Search the web for the answer to the query {{item.query}}\",\n", + " },\n", + " },\n", + " ],\n", + " },\n", + " \"model\": \"gpt-4.1-mini\",\n", + " \"sampling_params\": {\n", + " \"seed\": 42,\n", + " \"temperature\": 0.7,\n", + " \"max_completions_tokens\": 10000,\n", + " \"top_p\": 0.9,\n", + " \"tools\": [{\"type\": \"web_search_preview\"}],\n", + " },\n", + " },\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "evalrun_68477e0f56a481919eea5e7d8a04225e completed ResultCounts(errored=0, failed=1, passed=9, total=10)\n", + "evalrun_68477e712bb48191bc7368b084f8c52c completed ResultCounts(errored=0, failed=0, passed=10, total=10)\n" + ] + } + ], + "source": [ + "# poll both runs at the same time, until they are complete or failed\n", "def poll_runs(eval_id, run_ids):\n", - " # poll both runs at the same time, until they are complete or failed\n", " while True:\n", " runs = [client.evals.runs.retrieve(run_id, eval_id=eval_id) for run_id in run_ids]\n", " for run in runs:\n", " print(run.id, run.status, run.result_counts)\n", - " if all(run.status == \"completed\" or run.status == \"failed\" for run in runs):\n", + " if all(run.status in {\"completed\", \"failed\"} for run in runs):\n", " break\n", " time.sleep(5)\n", "\n", + "# Start polling the run until completion\n", + "poll_runs(logs_eval.id, [gpt_4one_responses_run.id, gpt_4one_mini_responses_run.id])\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Display and Interpret Model Outputs\n", "\n", - "poll_runs(logs_eval.id, [gpt_4one_responses_run.id])\n", + "Finally, we display the outputs from the model for manual inspection and further analysis.\n", "\n", + "- Each answer is printed for each query in the dataset.\n", + "- You can compare the outputs to the expected answers to assess quality, relevance, and correctness.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
GPT-4.1 OutputGPT-4.1-mini Output
0If you're captivated by the Philbrook Museum o...Bobby Fischer is widely regarded as one of the...
1\\n## [Paris, France](https://www.google.com/ma...The 2008 Olympic 100m dash is widely regarded ...
2Bill Gates, born on October 28, 1955, in Seatt...If you're looking for fun places to visit in T...
3Usain Bolt's performance in the 100-meter fina...On July 20, 1969, astronaut Neil Armstrong bec...
4It seems you're interested in both the world's...Bill Gates is a renowned software pioneer, phi...
5Neil Armstrong was the first person to walk on...Your statement, \"there is nothing better than ...
6Tesla, Inc. is an American electric vehicle an...The search engine whose name has become synony...
7Bobby Fischer, widely regarded as one of the g...\\n## [Paris, France](https://www.google.com/ma...
8Guido van Rossum, a Dutch programmer born on J...Guido van Rossum, a Dutch programmer born on J...
9The most popular search engine whose name has ...Elon Musk is the CEO and largest shareholder o...
\n", + "
" + ], + "text/plain": [ + " GPT-4.1 Output \\\n", + "0 If you're captivated by the Philbrook Museum o... \n", + "1 \\n## [Paris, France](https://www.google.com/ma... \n", + "2 Bill Gates, born on October 28, 1955, in Seatt... \n", + "3 Usain Bolt's performance in the 100-meter fina... \n", + "4 It seems you're interested in both the world's... \n", + "5 Neil Armstrong was the first person to walk on... \n", + "6 Tesla, Inc. is an American electric vehicle an... \n", + "7 Bobby Fischer, widely regarded as one of the g... \n", + "8 Guido van Rossum, a Dutch programmer born on J... \n", + "9 The most popular search engine whose name has ... \n", + "\n", + " GPT-4.1-mini Output \n", + "0 Bobby Fischer is widely regarded as one of the... \n", + "1 The 2008 Olympic 100m dash is widely regarded ... \n", + "2 If you're looking for fun places to visit in T... \n", + "3 On July 20, 1969, astronaut Neil Armstrong bec... \n", + "4 Bill Gates is a renowned software pioneer, phi... \n", + "5 Your statement, \"there is nothing better than ... \n", + "6 The search engine whose name has become synony... \n", + "7 \\n## [Paris, France](https://www.google.com/ma... \n", + "8 Guido van Rossum, a Dutch programmer born on J... \n", + "9 Elon Musk is the CEO and largest shareholder o... " + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Retrieve output items for the 4.1 model after completion\n", "four_one = client.evals.runs.output_items.list(\n", " run_id=gpt_4one_responses_run.id, eval_id=logs_eval.id\n", - ")" + ")\n", + "\n", + "# Retrieve output items for the 4.1-mini model after completion\n", + "four_one_mini = client.evals.runs.output_items.list(\n", + " run_id=gpt_4one_mini_responses_run.id, eval_id=logs_eval.id\n", + ")\n", + "\n", + "# Collect outputs for both models\n", + "four_one_outputs = [item.sample.output[0].content for item in four_one]\n", + "four_one_mini_outputs = [item.sample.output[0].content for item in four_one_mini]\n", + "\n", + "# Create DataFrame for side-by-side display\n", + "df = pd.DataFrame({\n", + " \"GPT-4.1 Output\": four_one_outputs,\n", + " \"GPT-4.1-mini Output\": four_one_mini_outputs\n", + "})\n", + "\n", + "display(df)" ] }, { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "for item in four_one:\n", - " print(item.sample.output[0].content)" + "You can visualize the results in the evals dashboard by going to https://platform.openai.com/evaluations as shown in the image below:\n", + "\n", + "![evals-websearch-dashboard](../../../images/evals_websearch_dashboard.png)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this notebook, we demonstrated a workflow for evaluating the web search capabilities of language models using the OpenAI Evals framework.\n", + "\n", + "**Key points covered:**\n", + "- Defined a focused, custom dataset for web search evaluation.\n", + "- Configured an LLM-based grader for robust assessment.\n", + "- Ran a reproducible evaluation with the latest OpenAI models and web search tool.\n", + "- Retrieved and displayed model outputs for inspection.\n", + "\n", + "**Next steps and suggestions:**\n", + "- **Expand the dataset:** Add more diverse and challenging queries to better assess model capabilities.\n", + "- **Analyze results:** Summarize pass/fail rates, visualize performance, or perform error analysis to identify strengths and weaknesses.\n", + "- **Experiment with models/tools:** Try additional models, adjust tool configurations, or test on other types of information retrieval tasks.\n", + "- **Automate reporting:** Generate summary tables or plots for easier sharing and decision-making.\n", + "\n", + "For more information, see the [OpenAI Evals documentation](https://platform.openai.com/docs/guides/evals)." ] } ], "metadata": { "kernelspec": { - "display_name": "openai", + "display_name": ".venv", "language": "python", "name": "python3" }, @@ -189,7 +573,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.9" + "version": "3.11.8" } }, "nbformat": 4, diff --git a/images/eval_qa_code.png b/images/eval_qa_code.png new file mode 100644 index 0000000000..623de97022 Binary files /dev/null and b/images/eval_qa_code.png differ diff --git a/images/eval_qa_data_1.png b/images/eval_qa_data_1.png new file mode 100644 index 0000000000..d9a8cec284 Binary files /dev/null and b/images/eval_qa_data_1.png differ diff --git a/images/eval_qa_data_2.png b/images/eval_qa_data_2.png new file mode 100644 index 0000000000..6e2895cb14 Binary files /dev/null and b/images/eval_qa_data_2.png differ diff --git a/images/evals_sentiment.png b/images/evals_sentiment.png new file mode 100644 index 0000000000..5ebe8f3072 Binary files /dev/null and b/images/evals_sentiment.png differ diff --git a/images/evals_websearch_dashboard.png b/images/evals_websearch_dashboard.png new file mode 100644 index 0000000000..ae34fc4c6a Binary files /dev/null and b/images/evals_websearch_dashboard.png differ diff --git a/registry.yaml b/registry.yaml index ac98ad8cc7..552b5fef48 100644 --- a/registry.yaml +++ b/registry.yaml @@ -2147,6 +2147,7 @@ date: 2025-06-09 authors: - josiah-openai + - shikhar-cyber tags: - evals-api - responses @@ -2167,6 +2168,7 @@ date: 2025-06-09 authors: - josiah-openai + - shikhar-cyber tags: - evals-api - responses