diff --git a/examples/evaluation/use-cases/web-search-evaluation.ipynb b/examples/evaluation/use-cases/web-search-evaluation.ipynb index 91f9dbb5f3..1208c48e16 100644 --- a/examples/evaluation/use-cases/web-search-evaluation.ipynb +++ b/examples/evaluation/use-cases/web-search-evaluation.ipynb @@ -11,7 +11,39 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "This notebook demonstrates how to evaluate a model's ability to retrieve correct answers from the web using the OpenAI **Evals** framework with a custom in-memory dataset." + "This notebook demonstrates how to evaluate a model's ability to retrieve correct answers from the web using the OpenAI **Evals** framework with a custom in-memory dataset.\n", + "\n", + "**Goals:**\n", + "- Show how to set up and run an evaluation for web search quality.\n", + "- Provide a template for evaluating information retrieval capabilities of LLMs.\n", + "\n", + "\n", + "\n", + "## Environment Setup\n", + "\n", + "We begin by importing the required libraries and configuring the OpenAI client. \n", + "This ensures we have access to the OpenAI API and all necessary utilities for evaluation." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.1.1\u001b[0m\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n", + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], + "source": [ + "# Update OpenAI client\n", + "%pip install --upgrade openai --quiet" ] }, { @@ -22,14 +54,37 @@ "source": [ "import os\n", "import time\n", + "import pandas as pd\n", + "from IPython.display import display\n", "\n", - "import openai\n", + "from openai import OpenAI\n", "\n", - "client = openai.OpenAI(api_key=os.getenv(\"OPENAI_API_KEY\") or os.getenv(\"_OPENAI_API_KEY\"))\n", + "client = OpenAI(\n", + " api_key=os.getenv(\"OPENAI_API_KEY\") or os.getenv(\"_OPENAI_API_KEY\"),\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Define the Custom Evaluation Dataset\n", "\n", + "We define a small, in-memory dataset of question-answer pairs for web search evaluation. \n", + "Each item contains a `query` (the user's search prompt) and an `answer` (the expected ground truth).\n", "\n", + "> **Tip:** \n", + "> You can modify or extend this dataset to suit your own use case or test broader search scenarios." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ "def get_dataset(limit=None):\n", - " return [\n", + " dataset = [\n", " {\n", " \"query\": \"coolest person in the world, the 100m dash at the 2008 olympics was the best sports event of all time\",\n", " \"answer\": \"usain bolt\",\n", @@ -42,9 +97,59 @@ " \"query\": \"most fun place to visit, I am obsessed with the Philbrook Museum of Art\",\n", " \"answer\": \"tulsa, oklahoma\",\n", " },\n", + " {\n", + " \"query\": \"who created the python programming language, beloved by data scientists everywhere\",\n", + " \"answer\": \"guido van rossum\",\n", + " },\n", + " {\n", + " \"query\": \"greatest chess player in history, famous for the 1972 world championship\",\n", + " \"answer\": \"bobby fischer\",\n", + " },\n", + " {\n", + " \"query\": \"the city of lights, home to the eiffel tower and louvre museum\",\n", + " \"answer\": \"paris\",\n", + " },\n", + " {\n", + " \"query\": \"most popular search engine, whose name is now a verb\",\n", + " \"answer\": \"google\",\n", + " },\n", + " {\n", + " \"query\": \"the first man to walk on the moon, giant leap for mankind\",\n", + " \"answer\": \"neil armstrong\",\n", + " },\n", + " {\n", + " \"query\": \"groundbreaking electric car company founded by elon musk\",\n", + " \"answer\": \"tesla\",\n", + " },\n", + " {\n", + " \"query\": \"founder of microsoft, philanthropist and software pioneer\",\n", + " \"answer\": \"bill gates\",\n", + " },\n", " ]\n", + " return dataset[:limit] if limit else dataset" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Define Grading Logic\n", + "\n", + "To evaluate the model’s answers, we use an LLM-based pass/fail grader:\n", "\n", + "- **Pass/Fail Grader:** \n", + " An LLM-based grader that checks if the model’s answer (from web search) matches the expected answer (ground truth) or contains the correct information.\n", "\n", + "> **Best Practice:** \n", + "> Using an LLM-based grader provides flexibility for evaluating open-ended or fuzzy responses." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ "pass_fail_grader = \"\"\"\n", "You are a helpful assistant that grades the quality of a web search.\n", "You will be given a query and an answer.\n", @@ -66,10 +171,36 @@ "\n", "{{item.answer}}\n", "\n", - "\"\"\"\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Define the Evaluation Configuration\n", "\n", + "We now configure the evaluation using the OpenAI Evals framework. \n", + "\n", + "This step specifies:\n", + "- The evaluation name and dataset.\n", + "- The schema for each item (what fields are present in each Q&A pair).\n", + "- The grader(s) to use (LLM-based pass/fail).\n", + "- The passing criteria and labels.\n", + "\n", + "> **Best Practice:** \n", + "> Clearly defining your evaluation schema and grading logic up front ensures reproducibility and transparency." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "# Create the evaluation definition using the OpenAI Evals client.\n", "logs_eval = client.evals.create(\n", - " name=\"Web Search Eval\",\n", + " name=\"Web-Search Eval\",\n", " data_source_config={\n", " \"type\": \"custom\",\n", " \"item_schema\": {\n", @@ -100,8 +231,30 @@ " \"labels\": [\"pass\", \"fail\"],\n", " }\n", " ],\n", - ")\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Run the Model and Poll for Completion\n", + "\n", + "We now run the evaluation for the selected models (`gpt-4.1` and `gpt-4.1-mini`). \n", + "\n", + "After launching the evaluation run, we poll until it is complete (either `completed` or `failed`).\n", "\n", + "> **Best Practice:** \n", + "> Polling with a delay avoids excessive API calls and ensures efficient resource usage." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "# Launch the evaluation run for gpt-4.1 using web search\n", "gpt_4one_responses_run = client.evals.runs.create(\n", " name=\"gpt-4.1\",\n", " eval_id=logs_eval.id,\n", @@ -141,41 +294,272 @@ " \"tools\": [{\"type\": \"web_search_preview\"}],\n", " },\n", " },\n", - ")\n", - "\n", - "\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "# Launch the evaluation run for gpt-4.1-mini using web search\n", + "gpt_4one_mini_responses_run = client.evals.runs.create(\n", + " name=\"gpt-4.1-mini\",\n", + " eval_id=logs_eval.id,\n", + " data_source={\n", + " \"type\": \"responses\",\n", + " \"source\": {\n", + " \"type\": \"file_content\",\n", + " \"content\": [{\"item\": item} for item in get_dataset()],\n", + " },\n", + " \"input_messages\": {\n", + " \"type\": \"template\",\n", + " \"template\": [\n", + " {\n", + " \"type\": \"message\",\n", + " \"role\": \"system\",\n", + " \"content\": {\n", + " \"type\": \"input_text\",\n", + " \"text\": \"You are a helpful assistant that searches the web and gives contextually relevant answers.\",\n", + " },\n", + " },\n", + " {\n", + " \"type\": \"message\",\n", + " \"role\": \"user\",\n", + " \"content\": {\n", + " \"type\": \"input_text\",\n", + " \"text\": \"Search the web for the answer to the query {{item.query}}\",\n", + " },\n", + " },\n", + " ],\n", + " },\n", + " \"model\": \"gpt-4.1-mini\",\n", + " \"sampling_params\": {\n", + " \"seed\": 42,\n", + " \"temperature\": 0.7,\n", + " \"max_completions_tokens\": 10000,\n", + " \"top_p\": 0.9,\n", + " \"tools\": [{\"type\": \"web_search_preview\"}],\n", + " },\n", + " },\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "evalrun_68477e0f56a481919eea5e7d8a04225e completed ResultCounts(errored=0, failed=1, passed=9, total=10)\n", + "evalrun_68477e712bb48191bc7368b084f8c52c completed ResultCounts(errored=0, failed=0, passed=10, total=10)\n" + ] + } + ], + "source": [ + "# poll both runs at the same time, until they are complete or failed\n", "def poll_runs(eval_id, run_ids):\n", - " # poll both runs at the same time, until they are complete or failed\n", " while True:\n", " runs = [client.evals.runs.retrieve(run_id, eval_id=eval_id) for run_id in run_ids]\n", " for run in runs:\n", " print(run.id, run.status, run.result_counts)\n", - " if all(run.status == \"completed\" or run.status == \"failed\" for run in runs):\n", + " if all(run.status in {\"completed\", \"failed\"} for run in runs):\n", " break\n", " time.sleep(5)\n", "\n", + "# Start polling the run until completion\n", + "poll_runs(logs_eval.id, [gpt_4one_responses_run.id, gpt_4one_mini_responses_run.id])\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Display and Interpret Model Outputs\n", "\n", - "poll_runs(logs_eval.id, [gpt_4one_responses_run.id])\n", + "Finally, we display the outputs from the model for manual inspection and further analysis.\n", "\n", + "- Each answer is printed for each query in the dataset.\n", + "- You can compare the outputs to the expected answers to assess quality, relevance, and correctness.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
GPT-4.1 OutputGPT-4.1-mini Output
0If you're captivated by the Philbrook Museum o...Bobby Fischer is widely regarded as one of the...
1\\n## [Paris, France](https://www.google.com/ma...The 2008 Olympic 100m dash is widely regarded ...
2Bill Gates, born on October 28, 1955, in Seatt...If you're looking for fun places to visit in T...
3Usain Bolt's performance in the 100-meter fina...On July 20, 1969, astronaut Neil Armstrong bec...
4It seems you're interested in both the world's...Bill Gates is a renowned software pioneer, phi...
5Neil Armstrong was the first person to walk on...Your statement, \"there is nothing better than ...
6Tesla, Inc. is an American electric vehicle an...The search engine whose name has become synony...
7Bobby Fischer, widely regarded as one of the g...\\n## [Paris, France](https://www.google.com/ma...
8Guido van Rossum, a Dutch programmer born on J...Guido van Rossum, a Dutch programmer born on J...
9The most popular search engine whose name has ...Elon Musk is the CEO and largest shareholder o...
\n", + "
" + ], + "text/plain": [ + " GPT-4.1 Output \\\n", + "0 If you're captivated by the Philbrook Museum o... \n", + "1 \\n## [Paris, France](https://www.google.com/ma... \n", + "2 Bill Gates, born on October 28, 1955, in Seatt... \n", + "3 Usain Bolt's performance in the 100-meter fina... \n", + "4 It seems you're interested in both the world's... \n", + "5 Neil Armstrong was the first person to walk on... \n", + "6 Tesla, Inc. is an American electric vehicle an... \n", + "7 Bobby Fischer, widely regarded as one of the g... \n", + "8 Guido van Rossum, a Dutch programmer born on J... \n", + "9 The most popular search engine whose name has ... \n", + "\n", + " GPT-4.1-mini Output \n", + "0 Bobby Fischer is widely regarded as one of the... \n", + "1 The 2008 Olympic 100m dash is widely regarded ... \n", + "2 If you're looking for fun places to visit in T... \n", + "3 On July 20, 1969, astronaut Neil Armstrong bec... \n", + "4 Bill Gates is a renowned software pioneer, phi... \n", + "5 Your statement, \"there is nothing better than ... \n", + "6 The search engine whose name has become synony... \n", + "7 \\n## [Paris, France](https://www.google.com/ma... \n", + "8 Guido van Rossum, a Dutch programmer born on J... \n", + "9 Elon Musk is the CEO and largest shareholder o... " + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Retrieve output items for the 4.1 model after completion\n", "four_one = client.evals.runs.output_items.list(\n", " run_id=gpt_4one_responses_run.id, eval_id=logs_eval.id\n", - ")" + ")\n", + "\n", + "# Retrieve output items for the 4.1-mini model after completion\n", + "four_one_mini = client.evals.runs.output_items.list(\n", + " run_id=gpt_4one_mini_responses_run.id, eval_id=logs_eval.id\n", + ")\n", + "\n", + "# Collect outputs for both models\n", + "four_one_outputs = [item.sample.output[0].content for item in four_one]\n", + "four_one_mini_outputs = [item.sample.output[0].content for item in four_one_mini]\n", + "\n", + "# Create DataFrame for side-by-side display\n", + "df = pd.DataFrame({\n", + " \"GPT-4.1 Output\": four_one_outputs,\n", + " \"GPT-4.1-mini Output\": four_one_mini_outputs\n", + "})\n", + "\n", + "display(df)" ] }, { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "for item in four_one:\n", - " print(item.sample.output[0].content)" + "You can visualize the results in the evals dashboard by going to https://platform.openai.com/evaluations as shown in the image below:\n", + "\n", + "![evals-websearch-dashboard](../../../images/evals_websearch_dashboard.png)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this notebook, we demonstrated a workflow for evaluating the web search capabilities of language models using the OpenAI Evals framework.\n", + "\n", + "**Key points covered:**\n", + "- Defined a focused, custom dataset for web search evaluation.\n", + "- Configured an LLM-based grader for robust assessment.\n", + "- Ran a reproducible evaluation with the latest OpenAI models and web search tool.\n", + "- Retrieved and displayed model outputs for inspection.\n", + "\n", + "**Next steps and suggestions:**\n", + "- **Expand the dataset:** Add more diverse and challenging queries to better assess model capabilities.\n", + "- **Analyze results:** Summarize pass/fail rates, visualize performance, or perform error analysis to identify strengths and weaknesses.\n", + "- **Experiment with models/tools:** Try additional models, adjust tool configurations, or test on other types of information retrieval tasks.\n", + "- **Automate reporting:** Generate summary tables or plots for easier sharing and decision-making.\n", + "\n", + "For more information, see the [OpenAI Evals documentation](https://platform.openai.com/docs/guides/evals)." ] } ], "metadata": { "kernelspec": { - "display_name": "openai", + "display_name": ".venv", "language": "python", "name": "python3" }, @@ -189,7 +573,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.9" + "version": "3.11.8" } }, "nbformat": 4, diff --git a/images/evals_websearch_dashboard.png b/images/evals_websearch_dashboard.png new file mode 100644 index 0000000000..ae34fc4c6a Binary files /dev/null and b/images/evals_websearch_dashboard.png differ diff --git a/registry.yaml b/registry.yaml index ac98ad8cc7..26bcd7dc0b 100644 --- a/registry.yaml +++ b/registry.yaml @@ -2167,6 +2167,7 @@ date: 2025-06-09 authors: - josiah-openai + - shikhar-cyber tags: - evals-api - responses