Skip to content

Commit 4596343

Browse files
Responses in evals!
1 parent 81f8df4 commit 4596343

File tree

3 files changed

+315
-0
lines changed

3 files changed

+315
-0
lines changed

authors.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,11 @@ ibigio:
6363
website: "https://twitter.com/ilanbigio"
6464
avatar: "https://pbs.twimg.com/profile_images/1841544725654077440/DR3b8DMr_400x400.jpg"
6565

66+
willhath-openai:
67+
name: "Will Hathaway"
68+
website: "https://www.willhath.com"
69+
avatar: "https://media.licdn.com/dms/image/v2/D4E03AQEHOtMrHtww4Q/profile-displayphoto-shrink_200_200/B4EZRR64p9HgAc-/0/1736541178829?e=2147483647&v=beta&t=w1rX0KhLZaK5qBkVLkJjmYmfNMbsV2Bcn8InFVX9lwI"
70+
6671
jhills20:
6772
name: "James Hills"
6873
website: "https://twitter.com/jamesmhills"
Lines changed: 301 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,301 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Evaluating a new model on existing responses"
8+
]
9+
},
10+
{
11+
"cell_type": "markdown",
12+
"metadata": {},
13+
"source": [
14+
"In the following eval, we are going to compare how a new model (gpt-4.1-mini) compares to our old model (gpt-4o-mini) by evaluating it on some stored responses. The benefit of this is for most developers, they won't have to spend any time putting together a whole eval -- all of their data will already be stored in their [logs page](https://platform.openai.com/logs)."
15+
]
16+
},
17+
{
18+
"cell_type": "code",
19+
"execution_count": 30,
20+
"metadata": {},
21+
"outputs": [],
22+
"source": [
23+
"import openai\n",
24+
"import os\n",
25+
"\n",
26+
"\n",
27+
"client = openai.OpenAI()"
28+
]
29+
},
30+
{
31+
"cell_type": "markdown",
32+
"metadata": {},
33+
"source": [
34+
"We want to see how gpt-4.1 compares to gpt-4o on explaining a code base. Since can only use the responses datasource if you already have user traffic, we're going to generate some example traffic using 4o, and then compare how it does to gpt-4.1. \n",
35+
"\n",
36+
"We're going to get some example code files from the OpenAI SDK, and ask gpt-4o to explain them to us."
37+
]
38+
},
39+
{
40+
"cell_type": "code",
41+
"execution_count": null,
42+
"metadata": {},
43+
"outputs": [],
44+
"source": [
45+
"openai_sdk_file_path = os.path.dirname(openai.__file__)\n",
46+
"\n",
47+
"# Get some example code files from the OpenAI SDK \n",
48+
"file_paths = [\n",
49+
" os.path.join(openai_sdk_file_path, \"resources\", \"evals\", \"evals.py\"),\n",
50+
" os.path.join(openai_sdk_file_path, \"resources\", \"responses\", \"responses.py\"),\n",
51+
" os.path.join(openai_sdk_file_path, \"resources\", \"images.py\"),\n",
52+
" os.path.join(openai_sdk_file_path, \"resources\", \"embeddings.py\"),\n",
53+
" os.path.join(openai_sdk_file_path, \"resources\", \"files.py\"),\n",
54+
"]\n",
55+
"\n",
56+
"print(file_paths[0])\n"
57+
]
58+
},
59+
{
60+
"cell_type": "markdown",
61+
"metadata": {},
62+
"source": [
63+
"Now, lets generate some responses. "
64+
]
65+
},
66+
{
67+
"cell_type": "code",
68+
"execution_count": null,
69+
"metadata": {},
70+
"outputs": [],
71+
"source": [
72+
"for file_path in file_paths:\n",
73+
" response = client.responses.create(\n",
74+
" input=[\n",
75+
" {\"role\": \"user\",\n",
76+
" \"content\": [\n",
77+
" {\n",
78+
" \"type\": \"input_text\",\n",
79+
" \"text\": \"What does this file do?\"\n",
80+
" },\n",
81+
" {\n",
82+
" \"type\": \"input_text\",\n",
83+
" \"text\": open(file_path, \"r\").read(),\n",
84+
" },\n",
85+
" ]},\n",
86+
" ],\n",
87+
" model=\"gpt-4o-mini\",\n",
88+
" )\n",
89+
" print(response.output_text)"
90+
]
91+
},
92+
{
93+
"cell_type": "markdown",
94+
"metadata": {},
95+
"source": [
96+
"Note that in order for this to work, you'll have to be doing this on an org where data logging isn't disabled (through zdr, etc). If you aren't sure if this is the case for you, go to https://platform.openai.com/logs?api=responses and see if you can see the responses you just generated."
97+
]
98+
},
99+
{
100+
"cell_type": "code",
101+
"execution_count": 31,
102+
"metadata": {},
103+
"outputs": [],
104+
"source": [
105+
"grader_system_prompt = \"\"\"\n",
106+
"You are **Code-Explanation Grader**, an expert software engineer and technical writer. \n",
107+
"Your job is to score how well *Model A* explained the purpose and behaviour of a given source-code file.\n",
108+
"\n",
109+
"### What you receive\n",
110+
"1. **File contents** – the full text of the code file (or a representative excerpt). \n",
111+
"2. **Candidate explanation** – the answer produced by Model A that tries to describe what the file does.\n",
112+
"\n",
113+
"### What to produce\n",
114+
"Return a single JSON object that can be parsed by `json.loads`, containing:\n",
115+
"```json\n",
116+
"{\n",
117+
" \"steps\": [\n",
118+
" { \"description\": \"...\", \"result\": \"float\" },\n",
119+
" { \"description\": \"...\", \"result\": \"float\" },\n",
120+
" { \"description\": \"...\", \"result\": \"float\" }\n",
121+
" ],\n",
122+
" \"result\": \"float\"\n",
123+
"}\n",
124+
"```\n",
125+
"• Each object in `steps` documents your reasoning for one category listed under “Scoring dimensions”. \n",
126+
"• Place your final 1 – 7 quality score (inclusive) in the top-level `result` key as a **string** (e.g. `\"5.5\"`).\n",
127+
"\n",
128+
"### Scoring dimensions (evaluate in this order)\n",
129+
"\n",
130+
"1. **Correctness & Accuracy ≈ 45 %** \n",
131+
" • Does the explanation match the actual code behaviour, interfaces, edge cases, and side effects? \n",
132+
" • Fact-check every technical claim; penalise hallucinations or missed key functionality.\n",
133+
"\n",
134+
"2. **Completeness & Depth ≈ 25 %** \n",
135+
" • Are all major components, classes, functions, data flows, and external dependencies covered? \n",
136+
" • Depth should be appropriate to the file’s size/complexity; superficial glosses lose points.\n",
137+
"\n",
138+
"3. **Clarity & Organization ≈ 20 %** \n",
139+
" • Is the explanation well-structured, logically ordered, and easy for a competent developer to follow? \n",
140+
" • Good use of headings, bullet lists, and concise language is rewarded.\n",
141+
"\n",
142+
"4. **Insight & Usefulness ≈ 10 %** \n",
143+
" • Does the answer add valuable context (e.g., typical use cases, performance notes, risks) beyond line-by-line paraphrase? \n",
144+
" • Highlighting **why** design choices matter is a plus.\n",
145+
"\n",
146+
"### Error taxonomy\n",
147+
"• **Major error** – Any statement that materially misrepresents the file (e.g., wrong API purpose, inventing non-existent behaviour). \n",
148+
"• **Minor error** – Small omission or wording that slightly reduces clarity but doesn’t mislead. \n",
149+
"List all found errors in your `steps` reasoning.\n",
150+
"\n",
151+
"### Numeric rubric\n",
152+
"1 Catastrophically wrong; mostly hallucination or irrelevant. \n",
153+
"2 Many major errors, few correct points. \n",
154+
"3 Several major errors OR pervasive minor mistakes; unreliable. \n",
155+
"4 Mostly correct but with at least one major gap or multiple minors; usable only with caution. \n",
156+
"5 Solid, generally correct; minor issues possible but no major flaws. \n",
157+
"6 Comprehensive, accurate, and clear; only very small nit-picks. \n",
158+
"7 Exceptional: precise, thorough, insightful, and elegantly presented; hard to improve.\n",
159+
"\n",
160+
"Use the full scale. Reserve 6.5 – 7 only when you are almost certain the explanation is outstanding.\n",
161+
"\n",
162+
"Then set `\"result\": \"4.0\"` (example).\n",
163+
"\n",
164+
"Be rigorous and unbiased.\n",
165+
"\"\"\"\n",
166+
"user_input_message = \"\"\"**User input**\n",
167+
"\n",
168+
"{{item.input}}\n",
169+
"\n",
170+
"**Response to evaluate**\n",
171+
"\n",
172+
"{{sample.output_text}}\n",
173+
"\"\"\""
174+
]
175+
},
176+
{
177+
"cell_type": "code",
178+
"execution_count": 25,
179+
"metadata": {},
180+
"outputs": [],
181+
"source": [
182+
"logs_eval = client.evals.create(\n",
183+
" name=\"Code QA Eval\",\n",
184+
" data_source_config={\n",
185+
" \"type\": \"logs\",\n",
186+
" },\n",
187+
" testing_criteria=[\n",
188+
" {\n",
189+
"\t\t\t\"type\": \"score_model\",\n",
190+
" \"name\": \"General Evaluator\",\n",
191+
" \"model\": \"o3\",\n",
192+
" \"input\": [{\n",
193+
" \"role\": \"system\",\n",
194+
" \"content\": grader_system_prompt,\n",
195+
" }, {\n",
196+
" \"role\": \"user\",\n",
197+
" \"content\": user_input_message,\n",
198+
" },\n",
199+
" ],\n",
200+
" \"range\": [1, 7],\n",
201+
" \"pass_threshold\": 5.5,\n",
202+
" }\n",
203+
" ]\n",
204+
")"
205+
]
206+
},
207+
{
208+
"cell_type": "markdown",
209+
"metadata": {},
210+
"source": [
211+
"First, lets kick off a run to evaluate how good the original responses were. To do this, we just set the filters for what responses we want to evaluate on"
212+
]
213+
},
214+
{
215+
"cell_type": "code",
216+
"execution_count": 26,
217+
"metadata": {},
218+
"outputs": [],
219+
"source": [
220+
"gpt_4o_mini_run = client.evals.runs.create(\n",
221+
" name=\"gpt-4o-mini\",\n",
222+
" eval_id=logs_eval.id,\n",
223+
" data_source={\n",
224+
" \"type\": \"responses\",\n",
225+
" \"source\": {\"type\": \"responses\", \"limit\": len(file_paths)}, # just grab the most recent responses\n",
226+
" },\n",
227+
")"
228+
]
229+
},
230+
{
231+
"cell_type": "markdown",
232+
"metadata": {},
233+
"source": [
234+
"Now, let's see how 4.1-mini does!"
235+
]
236+
},
237+
{
238+
"cell_type": "code",
239+
"execution_count": 27,
240+
"metadata": {},
241+
"outputs": [],
242+
"source": [
243+
"gpt_41_mini_run = client.evals.runs.create(\n",
244+
" name=\"gpt-4.1-mini\",\n",
245+
" eval_id=logs_eval.id,\n",
246+
" data_source={\n",
247+
" \"type\": \"responses\",\n",
248+
" \"source\": {\"type\": \"responses\", \"limit\": len(file_paths)},\n",
249+
" \"input_messages\": {\n",
250+
" \"type\": \"item_reference\",\n",
251+
" \"item_reference\": \"item.input\",\n",
252+
" },\n",
253+
" \"model\": \"gpt-4.1-mini\",\n",
254+
" }\n",
255+
")"
256+
]
257+
},
258+
{
259+
"cell_type": "markdown",
260+
"metadata": {},
261+
"source": [
262+
"Now, lets go to the dashboard to see how we did!"
263+
]
264+
},
265+
{
266+
"cell_type": "code",
267+
"execution_count": null,
268+
"metadata": {},
269+
"outputs": [],
270+
"source": [
271+
"gpt_4o_mini_run.report_url"
272+
]
273+
},
274+
{
275+
"cell_type": "markdown",
276+
"metadata": {},
277+
"source": []
278+
}
279+
],
280+
"metadata": {
281+
"kernelspec": {
282+
"display_name": "Python 3",
283+
"language": "python",
284+
"name": "python3"
285+
},
286+
"language_info": {
287+
"codemirror_mode": {
288+
"name": "ipython",
289+
"version": 3
290+
},
291+
"file_extension": ".py",
292+
"mimetype": "text/x-python",
293+
"name": "python",
294+
"nbconvert_exporter": "python",
295+
"pygments_lexer": "ipython3",
296+
"version": "3.11.8"
297+
}
298+
},
299+
"nbformat": 4,
300+
"nbformat_minor": 2
301+
}

registry.yaml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,15 @@
7979
- evalsapi
8080
- completions
8181

82+
- title: EvalsAPI Use-case - Responses Evaluation
83+
path: examples/evaluation/use-cases/responses-evaluation.ipynb
84+
date: 2025-05-13
85+
authors:
86+
- willhath-openai
87+
tags:
88+
- evalsapi
89+
- responses
90+
8291
- title: Multi-Tool Orchestration with RAG approach using OpenAI's Responses API
8392
path: examples/responses_api/responses_api_tool_orchestration.ipynb
8493
date: 2025-03-28

0 commit comments

Comments
 (0)