diff --git a/examples/evaluation/use-cases/web-search-evaluation.ipynb b/examples/evaluation/use-cases/web-search-evaluation.ipynb
index 91f9dbb5f3..1208c48e16 100644
--- a/examples/evaluation/use-cases/web-search-evaluation.ipynb
+++ b/examples/evaluation/use-cases/web-search-evaluation.ipynb
@@ -11,7 +11,39 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "This notebook demonstrates how to evaluate a model's ability to retrieve correct answers from the web using the OpenAI **Evals** framework with a custom in-memory dataset."
+ "This notebook demonstrates how to evaluate a model's ability to retrieve correct answers from the web using the OpenAI **Evals** framework with a custom in-memory dataset.\n",
+ "\n",
+ "**Goals:**\n",
+ "- Show how to set up and run an evaluation for web search quality.\n",
+ "- Provide a template for evaluating information retrieval capabilities of LLMs.\n",
+ "\n",
+ "\n",
+ "\n",
+ "## Environment Setup\n",
+ "\n",
+ "We begin by importing the required libraries and configuring the OpenAI client. \n",
+ "This ensures we have access to the OpenAI API and all necessary utilities for evaluation."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n",
+ "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.1.1\u001b[0m\n",
+ "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n",
+ "Note: you may need to restart the kernel to use updated packages.\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Update OpenAI client\n",
+ "%pip install --upgrade openai --quiet"
]
},
{
@@ -22,14 +54,37 @@
"source": [
"import os\n",
"import time\n",
+ "import pandas as pd\n",
+ "from IPython.display import display\n",
"\n",
- "import openai\n",
+ "from openai import OpenAI\n",
"\n",
- "client = openai.OpenAI(api_key=os.getenv(\"OPENAI_API_KEY\") or os.getenv(\"_OPENAI_API_KEY\"))\n",
+ "client = OpenAI(\n",
+ " api_key=os.getenv(\"OPENAI_API_KEY\") or os.getenv(\"_OPENAI_API_KEY\"),\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Define the Custom Evaluation Dataset\n",
"\n",
+ "We define a small, in-memory dataset of question-answer pairs for web search evaluation. \n",
+ "Each item contains a `query` (the user's search prompt) and an `answer` (the expected ground truth).\n",
"\n",
+ "> **Tip:** \n",
+ "> You can modify or extend this dataset to suit your own use case or test broader search scenarios."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [],
+ "source": [
"def get_dataset(limit=None):\n",
- " return [\n",
+ " dataset = [\n",
" {\n",
" \"query\": \"coolest person in the world, the 100m dash at the 2008 olympics was the best sports event of all time\",\n",
" \"answer\": \"usain bolt\",\n",
@@ -42,9 +97,59 @@
" \"query\": \"most fun place to visit, I am obsessed with the Philbrook Museum of Art\",\n",
" \"answer\": \"tulsa, oklahoma\",\n",
" },\n",
+ " {\n",
+ " \"query\": \"who created the python programming language, beloved by data scientists everywhere\",\n",
+ " \"answer\": \"guido van rossum\",\n",
+ " },\n",
+ " {\n",
+ " \"query\": \"greatest chess player in history, famous for the 1972 world championship\",\n",
+ " \"answer\": \"bobby fischer\",\n",
+ " },\n",
+ " {\n",
+ " \"query\": \"the city of lights, home to the eiffel tower and louvre museum\",\n",
+ " \"answer\": \"paris\",\n",
+ " },\n",
+ " {\n",
+ " \"query\": \"most popular search engine, whose name is now a verb\",\n",
+ " \"answer\": \"google\",\n",
+ " },\n",
+ " {\n",
+ " \"query\": \"the first man to walk on the moon, giant leap for mankind\",\n",
+ " \"answer\": \"neil armstrong\",\n",
+ " },\n",
+ " {\n",
+ " \"query\": \"groundbreaking electric car company founded by elon musk\",\n",
+ " \"answer\": \"tesla\",\n",
+ " },\n",
+ " {\n",
+ " \"query\": \"founder of microsoft, philanthropist and software pioneer\",\n",
+ " \"answer\": \"bill gates\",\n",
+ " },\n",
" ]\n",
+ " return dataset[:limit] if limit else dataset"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Define Grading Logic\n",
+ "\n",
+ "To evaluate the model’s answers, we use an LLM-based pass/fail grader:\n",
"\n",
+ "- **Pass/Fail Grader:** \n",
+ " An LLM-based grader that checks if the model’s answer (from web search) matches the expected answer (ground truth) or contains the correct information.\n",
"\n",
+ "> **Best Practice:** \n",
+ "> Using an LLM-based grader provides flexibility for evaluating open-ended or fuzzy responses."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [],
+ "source": [
"pass_fail_grader = \"\"\"\n",
"You are a helpful assistant that grades the quality of a web search.\n",
"You will be given a query and an answer.\n",
@@ -66,10 +171,36 @@
"\n",
"{{item.answer}}\n",
"\n",
- "\"\"\"\n",
+ "\"\"\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Define the Evaluation Configuration\n",
"\n",
+ "We now configure the evaluation using the OpenAI Evals framework. \n",
+ "\n",
+ "This step specifies:\n",
+ "- The evaluation name and dataset.\n",
+ "- The schema for each item (what fields are present in each Q&A pair).\n",
+ "- The grader(s) to use (LLM-based pass/fail).\n",
+ "- The passing criteria and labels.\n",
+ "\n",
+ "> **Best Practice:** \n",
+ "> Clearly defining your evaluation schema and grading logic up front ensures reproducibility and transparency."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Create the evaluation definition using the OpenAI Evals client.\n",
"logs_eval = client.evals.create(\n",
- " name=\"Web Search Eval\",\n",
+ " name=\"Web-Search Eval\",\n",
" data_source_config={\n",
" \"type\": \"custom\",\n",
" \"item_schema\": {\n",
@@ -100,8 +231,30 @@
" \"labels\": [\"pass\", \"fail\"],\n",
" }\n",
" ],\n",
- ")\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Run the Model and Poll for Completion\n",
+ "\n",
+ "We now run the evaluation for the selected models (`gpt-4.1` and `gpt-4.1-mini`). \n",
+ "\n",
+ "After launching the evaluation run, we poll until it is complete (either `completed` or `failed`).\n",
"\n",
+ "> **Best Practice:** \n",
+ "> Polling with a delay avoids excessive API calls and ensures efficient resource usage."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Launch the evaluation run for gpt-4.1 using web search\n",
"gpt_4one_responses_run = client.evals.runs.create(\n",
" name=\"gpt-4.1\",\n",
" eval_id=logs_eval.id,\n",
@@ -141,41 +294,272 @@
" \"tools\": [{\"type\": \"web_search_preview\"}],\n",
" },\n",
" },\n",
- ")\n",
- "\n",
- "\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Launch the evaluation run for gpt-4.1-mini using web search\n",
+ "gpt_4one_mini_responses_run = client.evals.runs.create(\n",
+ " name=\"gpt-4.1-mini\",\n",
+ " eval_id=logs_eval.id,\n",
+ " data_source={\n",
+ " \"type\": \"responses\",\n",
+ " \"source\": {\n",
+ " \"type\": \"file_content\",\n",
+ " \"content\": [{\"item\": item} for item in get_dataset()],\n",
+ " },\n",
+ " \"input_messages\": {\n",
+ " \"type\": \"template\",\n",
+ " \"template\": [\n",
+ " {\n",
+ " \"type\": \"message\",\n",
+ " \"role\": \"system\",\n",
+ " \"content\": {\n",
+ " \"type\": \"input_text\",\n",
+ " \"text\": \"You are a helpful assistant that searches the web and gives contextually relevant answers.\",\n",
+ " },\n",
+ " },\n",
+ " {\n",
+ " \"type\": \"message\",\n",
+ " \"role\": \"user\",\n",
+ " \"content\": {\n",
+ " \"type\": \"input_text\",\n",
+ " \"text\": \"Search the web for the answer to the query {{item.query}}\",\n",
+ " },\n",
+ " },\n",
+ " ],\n",
+ " },\n",
+ " \"model\": \"gpt-4.1-mini\",\n",
+ " \"sampling_params\": {\n",
+ " \"seed\": 42,\n",
+ " \"temperature\": 0.7,\n",
+ " \"max_completions_tokens\": 10000,\n",
+ " \"top_p\": 0.9,\n",
+ " \"tools\": [{\"type\": \"web_search_preview\"}],\n",
+ " },\n",
+ " },\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "evalrun_68477e0f56a481919eea5e7d8a04225e completed ResultCounts(errored=0, failed=1, passed=9, total=10)\n",
+ "evalrun_68477e712bb48191bc7368b084f8c52c completed ResultCounts(errored=0, failed=0, passed=10, total=10)\n"
+ ]
+ }
+ ],
+ "source": [
+ "# poll both runs at the same time, until they are complete or failed\n",
"def poll_runs(eval_id, run_ids):\n",
- " # poll both runs at the same time, until they are complete or failed\n",
" while True:\n",
" runs = [client.evals.runs.retrieve(run_id, eval_id=eval_id) for run_id in run_ids]\n",
" for run in runs:\n",
" print(run.id, run.status, run.result_counts)\n",
- " if all(run.status == \"completed\" or run.status == \"failed\" for run in runs):\n",
+ " if all(run.status in {\"completed\", \"failed\"} for run in runs):\n",
" break\n",
" time.sleep(5)\n",
"\n",
+ "# Start polling the run until completion\n",
+ "poll_runs(logs_eval.id, [gpt_4one_responses_run.id, gpt_4one_mini_responses_run.id])\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Display and Interpret Model Outputs\n",
"\n",
- "poll_runs(logs_eval.id, [gpt_4one_responses_run.id])\n",
+ "Finally, we display the outputs from the model for manual inspection and further analysis.\n",
"\n",
+ "- Each answer is printed for each query in the dataset.\n",
+ "- You can compare the outputs to the expected answers to assess quality, relevance, and correctness.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 25,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " GPT-4.1 Output | \n",
+ " GPT-4.1-mini Output | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " If you're captivated by the Philbrook Museum o... | \n",
+ " Bobby Fischer is widely regarded as one of the... | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " \\n## [Paris, France](https://www.google.com/ma... | \n",
+ " The 2008 Olympic 100m dash is widely regarded ... | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " Bill Gates, born on October 28, 1955, in Seatt... | \n",
+ " If you're looking for fun places to visit in T... | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " Usain Bolt's performance in the 100-meter fina... | \n",
+ " On July 20, 1969, astronaut Neil Armstrong bec... | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " It seems you're interested in both the world's... | \n",
+ " Bill Gates is a renowned software pioneer, phi... | \n",
+ "
\n",
+ " \n",
+ " 5 | \n",
+ " Neil Armstrong was the first person to walk on... | \n",
+ " Your statement, \"there is nothing better than ... | \n",
+ "
\n",
+ " \n",
+ " 6 | \n",
+ " Tesla, Inc. is an American electric vehicle an... | \n",
+ " The search engine whose name has become synony... | \n",
+ "
\n",
+ " \n",
+ " 7 | \n",
+ " Bobby Fischer, widely regarded as one of the g... | \n",
+ " \\n## [Paris, France](https://www.google.com/ma... | \n",
+ "
\n",
+ " \n",
+ " 8 | \n",
+ " Guido van Rossum, a Dutch programmer born on J... | \n",
+ " Guido van Rossum, a Dutch programmer born on J... | \n",
+ "
\n",
+ " \n",
+ " 9 | \n",
+ " The most popular search engine whose name has ... | \n",
+ " Elon Musk is the CEO and largest shareholder o... | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " GPT-4.1 Output \\\n",
+ "0 If you're captivated by the Philbrook Museum o... \n",
+ "1 \\n## [Paris, France](https://www.google.com/ma... \n",
+ "2 Bill Gates, born on October 28, 1955, in Seatt... \n",
+ "3 Usain Bolt's performance in the 100-meter fina... \n",
+ "4 It seems you're interested in both the world's... \n",
+ "5 Neil Armstrong was the first person to walk on... \n",
+ "6 Tesla, Inc. is an American electric vehicle an... \n",
+ "7 Bobby Fischer, widely regarded as one of the g... \n",
+ "8 Guido van Rossum, a Dutch programmer born on J... \n",
+ "9 The most popular search engine whose name has ... \n",
+ "\n",
+ " GPT-4.1-mini Output \n",
+ "0 Bobby Fischer is widely regarded as one of the... \n",
+ "1 The 2008 Olympic 100m dash is widely regarded ... \n",
+ "2 If you're looking for fun places to visit in T... \n",
+ "3 On July 20, 1969, astronaut Neil Armstrong bec... \n",
+ "4 Bill Gates is a renowned software pioneer, phi... \n",
+ "5 Your statement, \"there is nothing better than ... \n",
+ "6 The search engine whose name has become synony... \n",
+ "7 \\n## [Paris, France](https://www.google.com/ma... \n",
+ "8 Guido van Rossum, a Dutch programmer born on J... \n",
+ "9 Elon Musk is the CEO and largest shareholder o... "
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "# Retrieve output items for the 4.1 model after completion\n",
"four_one = client.evals.runs.output_items.list(\n",
" run_id=gpt_4one_responses_run.id, eval_id=logs_eval.id\n",
- ")"
+ ")\n",
+ "\n",
+ "# Retrieve output items for the 4.1-mini model after completion\n",
+ "four_one_mini = client.evals.runs.output_items.list(\n",
+ " run_id=gpt_4one_mini_responses_run.id, eval_id=logs_eval.id\n",
+ ")\n",
+ "\n",
+ "# Collect outputs for both models\n",
+ "four_one_outputs = [item.sample.output[0].content for item in four_one]\n",
+ "four_one_mini_outputs = [item.sample.output[0].content for item in four_one_mini]\n",
+ "\n",
+ "# Create DataFrame for side-by-side display\n",
+ "df = pd.DataFrame({\n",
+ " \"GPT-4.1 Output\": four_one_outputs,\n",
+ " \"GPT-4.1-mini Output\": four_one_mini_outputs\n",
+ "})\n",
+ "\n",
+ "display(df)"
]
},
{
- "cell_type": "code",
- "execution_count": null,
+ "cell_type": "markdown",
"metadata": {},
- "outputs": [],
"source": [
- "for item in four_one:\n",
- " print(item.sample.output[0].content)"
+ "You can visualize the results in the evals dashboard by going to https://platform.openai.com/evaluations as shown in the image below:\n",
+ "\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "In this notebook, we demonstrated a workflow for evaluating the web search capabilities of language models using the OpenAI Evals framework.\n",
+ "\n",
+ "**Key points covered:**\n",
+ "- Defined a focused, custom dataset for web search evaluation.\n",
+ "- Configured an LLM-based grader for robust assessment.\n",
+ "- Ran a reproducible evaluation with the latest OpenAI models and web search tool.\n",
+ "- Retrieved and displayed model outputs for inspection.\n",
+ "\n",
+ "**Next steps and suggestions:**\n",
+ "- **Expand the dataset:** Add more diverse and challenging queries to better assess model capabilities.\n",
+ "- **Analyze results:** Summarize pass/fail rates, visualize performance, or perform error analysis to identify strengths and weaknesses.\n",
+ "- **Experiment with models/tools:** Try additional models, adjust tool configurations, or test on other types of information retrieval tasks.\n",
+ "- **Automate reporting:** Generate summary tables or plots for easier sharing and decision-making.\n",
+ "\n",
+ "For more information, see the [OpenAI Evals documentation](https://platform.openai.com/docs/guides/evals)."
]
}
],
"metadata": {
"kernelspec": {
- "display_name": "openai",
+ "display_name": ".venv",
"language": "python",
"name": "python3"
},
@@ -189,7 +573,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
- "version": "3.12.9"
+ "version": "3.11.8"
}
},
"nbformat": 4,
diff --git a/images/evals_websearch_dashboard.png b/images/evals_websearch_dashboard.png
new file mode 100644
index 0000000000..ae34fc4c6a
Binary files /dev/null and b/images/evals_websearch_dashboard.png differ
diff --git a/registry.yaml b/registry.yaml
index ac98ad8cc7..26bcd7dc0b 100644
--- a/registry.yaml
+++ b/registry.yaml
@@ -2167,6 +2167,7 @@
date: 2025-06-09
authors:
- josiah-openai
+ - shikhar-cyber
tags:
- evals-api
- responses