diff --git a/authors.yaml b/authors.yaml
index 3ce5795e3c..4d0a696afe 100644
--- a/authors.yaml
+++ b/authors.yaml
@@ -3,6 +3,11 @@
# You can optionally customize how your information shows up cookbook.openai.com over here.
# If your information is not present here, it will be pulled from your GitHub profile.
+theophile-openai:
+ name: "Theophile Sautory"
+ website: "https://www.linkedin.com/in/theophilesautory"
+ avatar: "https://avatars.githubusercontent.com/u/206768658?v=4"
+
robert-tinn:
name: "Robert Tinn"
website: "https://www.linkedin.com/in/robert-tinn/"
diff --git a/examples/Reinforcement_Fine_Tuning.ipynb b/examples/Reinforcement_Fine_Tuning.ipynb
new file mode 100644
index 0000000000..9c7a8609e0
--- /dev/null
+++ b/examples/Reinforcement_Fine_Tuning.ipynb
@@ -0,0 +1,2135 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# **Exploring Model Graders for Reinforcement Fine-Tuning**\n",
+ "\n",
+ "*This guide is for developers and ML practitioners who already know their way around OpenAIʼs APIs, have a basic understanding of reinforcement fine-tuning (RFT), and wish to use their fine-tuned models for research or other appropriate uses. OpenAI’s services are not intended for the personalized treatment or diagnosis of any medical condition and are subject to our [applicable terms](https://openai.com/policies/).*\n",
+ "\n",
+ "[Reinforcement fine-tuning (RFT)](https://platform.openai.com/docs/guides/reinforcement-fine-tuning) of reasoning models consists in running reinforcement learning on of top the models to improve their reasoning performance by exploring the solution space and reinforcing strategies that result in a higher reward. RFT helps the model make sharper decisions and interpret context more effectively. \n",
+ "\n",
+ "\n",
+ "In this guide, weʼll walk through how to apply RFT to the OpenAI `o4-mini` reasoning model, using a task from the life sciences research domain: predicting outcomes from doctor-patient transcripts and descriptions, which is a necessary assessment in many health research studies. We'll use a subset of the medical-o1-verifiable-problem [dataset](https://huggingface.co/datasets/FreedomIntelligence/medical-o1-verifiable-problem/viewer/default/train?row=0). You will learn key steps to take in order to succesfully run RFT jobs for your use-cases.\n",
+ "\n",
+ "Here’s what we’ll cover:\n",
+ "\n",
+ "- **[1. Setup](#1-setup)**\n",
+ "- **[2. Gathering the dataset](#2-gathering-the-dataset)**\n",
+ "- **[3. Benchmarking the base model](#3-benchmarking-the-base-model)**\n",
+ "- **[4. Defining your grader](#4-defining-your-grader)**\n",
+ "- **[5. Training](#5-training)**\n",
+ "- **[6. Using your fine-tuned model](#6-using-your-fine-tuned-model)**\n",
+ "\n",
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## **1. Setup**\n",
+ "\n",
+ "Even strong reasoning models can miss the mark when it comes to expert-level behavior-especially in domains like medicine, where nuance and exactness matter. Imagine a model trying to extract [ICD-10](https://www.cms.gov/medicare/coding-billing/icd-10-codes) codes from a transcript: even if it understands the gist, it may not use the precise terminology expected by medical professionals. \n",
+ "\n",
+ "Other great candidates for RFT include topics like ledger normalization or tiering fraud risk- settings in which you want precise, reliable, and repeatable reasoning. Checkout our [RFT use-cases guide](https://platform.openai.com/docs/guides/rft-use-cases) for great examples. \n",
+ "\n",
+ "In our case, weʼll focus on teaching `o4-mini` to become better at predicting the outcomes of clinical conversations and descriptions. Specifically, we want to see if RFT can boost the accuracy of the prediction. \n",
+ "\n",
+ "Along the way, weʼll talk about how to write effective graders, how they guide the modelʼs learning, and how to watch out for classic reward-hacking pitfalls. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## **2. Gathering the Dataset**\n",
+ "\n",
+ "Letʼs start off by loading the dataset from Hugging Face. Weʼre interested in samples framed as a description of a patient case with an associated question, followed by the correct answer. These represent real world transcripts where a physician is summarizing a case and assigning an outcome. For any use-case, verifying the accuracy of the gold level answers is critical and requires careful consideration. Here, we will trust the dataset quality."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 116,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Filtered samples: 9169\n"
+ ]
+ }
+ ],
+ "source": [
+ "import re\n",
+ "from datasets import load_dataset\n",
+ "ds = load_dataset(\"FreedomIntelligence/medical-o1-verifiable-problem\")\n",
+ "\n",
+ "def is_age_question(sample):\n",
+ " question = sample.get('Open-ended Verifiable Question', '')\n",
+ " # Match \"A 88-year-old\", \"An 8-year-old\", \"A 23-year-old\", etc. at the start\n",
+ " return re.match(r\"^(A|An) \\d{1,2}-year-old\", question) is not None\n",
+ "\n",
+ "filtered_samples = [s for s in ds[\"train\"] if is_age_question(s)]\n",
+ "print(f\"Filtered samples: {len(filtered_samples)}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "One of the advantages of RFT is that it doesnʼt need thousands of samples to start making a difference. Thanks to trajectory sampling and the feedback loop during training, the model learns not just correct behaviors, but also patterns to avoid. This means we can see solid gains even with small datasets.\n",
+ "\n",
+ "For this run, weʼll randomly sample 100 training and 100 test examples and slightly normalize them."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 82,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Number of training samples: 100\n",
+ "Number of test samples: 100\n"
+ ]
+ }
+ ],
+ "source": [
+ "import random\n",
+ "\n",
+ "# Set a random seed for reproducibility\n",
+ "random.seed(42)\n",
+ "\n",
+ "# Randomly select 100 training samples from filtered_samples\n",
+ "train_samples = random.sample(filtered_samples, min(100, len(filtered_samples)))\n",
+ "\n",
+ "# Remove training samples from filtered_samples to avoid overlap\n",
+ "remaining_samples = [s for s in filtered_samples if s not in train_samples]\n",
+ "\n",
+ "# Randomly select 100 test samples from the remaining samples (no overlap)\n",
+ "test_samples = random.sample(remaining_samples, min(100, len(remaining_samples)))\n",
+ "\n",
+ "print(f\"Number of training samples: {len(train_samples)}\")\n",
+ "print(f\"Number of test samples: {len(test_samples)}\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Standardize the 'Ground-True Answer' fields to all lowercase in train and test samples\n",
+ "for sample in train_samples:\n",
+ " if 'Ground-True Answer' in sample and isinstance(sample['Ground-True Answer'], str):\n",
+ " sample['Ground-True Answer'] = sample['Ground-True Answer'].lower()\n",
+ "\n",
+ "for sample in test_samples:\n",
+ " if 'Ground-True Answer' in sample and isinstance(sample['Ground-True Answer'], str):\n",
+ " sample['Ground-True Answer'] = sample['Ground-True Answer'].lower()\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We'll convert these samples to `jsonl` format, as expected by the [reinforcement finetuning API](https://platform.openai.com/docs/api-reference/fine-tuning/reinforcement-input)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import json\n",
+ "\n",
+ "def convert_to_jsonl_format(samples, filename):\n",
+ " with open(filename, \"w\") as f:\n",
+ " for sample in samples:\n",
+ " user_content = sample.get(\"Open-ended Verifiable Question\", \"\")\n",
+ " reference_answer = sample.get(\"Ground-True Answer\", \"\")\n",
+ " json_obj = {\n",
+ " \"messages\": [\n",
+ " {\"role\": \"user\", \"content\": user_content}\n",
+ " ],\n",
+ " \"reference_answer\": reference_answer\n",
+ " }\n",
+ " f.write(json.dumps(json_obj) + \"\\n\")\n",
+ "\n",
+ "def load_jsonl(filename):\n",
+ " samples = []\n",
+ " with open(filename, \"r\") as f:\n",
+ " for line in f:\n",
+ " samples.append(json.loads(line))\n",
+ " return samples\n",
+ "\n",
+ "# Save the datasets to jsonl files\n",
+ "convert_to_jsonl_format(train_samples, \"data/medical_01_verifiable_problem_train.jsonl\")\n",
+ "convert_to_jsonl_format(test_samples, \"data/medical_01_verifiable_problem_val.jsonl\")\n",
+ "\n",
+ "# Load the datasets back from jsonl files\n",
+ "train_samples_loaded = load_jsonl(\"data/medical_01_verifiable_problem_train.jsonl\")\n",
+ "test_samples_loaded = load_jsonl(\"data/medical_01_verifiable_problem_val.jsonl\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "Next up: we’ll see how the base model performs out of the box-and where there’s room to grow.\n",
+ "\n",
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## **3. Benchmarking the Base Model**\n",
+ "\n",
+ "Before we fine-tune anything, we need to know where we’re starting from. Benchmarking gives us a clear picture of the model’s initial strengths and weaknesses-so we can later measure how far it’s come.\n",
+ "\n",
+ "We’ll first lean on two simple yet powerful evaluators:\n",
+ "\n",
+ "1. `clinical_phrase_binary_grader` - an exact-match checker.\n",
+ "2. `clinical_phrase_grader` - a softer, token-based similarity grader."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from rapidfuzz import fuzz, utils\n",
+ "\n",
+ "def clinical_phrase_grader(sample: dict, item: dict) -> float:\n",
+ " from rapidfuzz import fuzz, utils\n",
+ " score = fuzz.token_set_ratio(sample[\"output_text\"], item[\"reference_answer\"], processor=utils.default_process)\n",
+ " return score / 100.0\n",
+ "\n",
+ "def clinical_phrase_binary_grader(sample: dict, item: dict) -> float:\n",
+ " return 1.0 if sample[\"output_text\"] == item[\"reference_answer\"] else 0.0\n",
+ "\n",
+ "def combined_grader(sample: dict, item: dict, weights: list[float] = [0.85, 0.15]) -> float:\n",
+ " clinical_phrase_score = clinical_phrase_grader(sample, item)\n",
+ " binary_score = clinical_phrase_binary_grader(sample, item)\n",
+ " return weights[0] * clinical_phrase_score + weights[1] * binary_score"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "This combination lets us track both strict correctness and partial lexical overlap. The binary grader gives a crisp 0 or 1: did the model produce an exact match? The softer one gives more nuance-how close did the output come to the gold answer? We use both because outcomes are often phrased in multiple valid ways. For instance, a model might respond with “gouty arthritis” instead of “gout.” While a human evaluator could consider this partially acceptable, a strict string match would not. Combining exact and fuzzy scoring ensures a more accurate and fair assessment of model outputs. \n",
+ "\n",
+ "We build a helper function to preprend the examples with a system prompt."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def prepend_system_prompt_to_first_user_message(samples, system_prompt, path=None):\n",
+ " new_samples = []\n",
+ " for sample in samples:\n",
+ " # Deep copy to avoid mutating the original\n",
+ " sample_copy = json.loads(json.dumps(sample))\n",
+ " messages = sample_copy.get(\"messages\", [])\n",
+ " if messages and messages[0].get(\"role\") == \"user\" and isinstance(messages[0].get(\"content\"), str):\n",
+ " if not messages[0][\"content\"].startswith(system_prompt):\n",
+ " messages[0][\"content\"] = f\"{system_prompt}\\n\\n{messages[0]['content']}\"\n",
+ " new_samples.append(sample_copy)\n",
+ " if path is not None:\n",
+ " with open(path, \"w\", encoding=\"utf-8\") as f:\n",
+ " for item in new_samples:\n",
+ " f.write(json.dumps(item, ensure_ascii=False) + \"\\n\")\n",
+ " return new_samples"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "simple_prompt = \"\"\"You are an expert clinician. For each clinical vignette, respond with exactly one phrase: the single most likely outcome or phenomenon, all in lowercase. \n",
+ "- Do not add punctuation, articles, explanations, or commentary - output only the term itself.\n",
+ "- Sometimes, the expected answer can be a synonym of what you think.\n",
+ "- Use the standard clinical name (e.g. “thought withdrawal”, “Toxoplasma encephalitis”).\"\"\"\n",
+ "train_samples_loaded_simple_sys_prompt = prepend_system_prompt_to_first_user_message(\n",
+ " train_samples_loaded, simple_prompt, path=\"data/medical_01_verifiable_problem_train_simple_prompt.jsonl\"\n",
+ ")\n",
+ "test_samples_loaded_simple_sys_prompt = prepend_system_prompt_to_first_user_message(\n",
+ " test_samples_loaded, simple_prompt, path=\"data/medical_01_verifiable_problem_val_simple_prompt.jsonl\"\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Then build a helper function to generate and store the model's predictions."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 68,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from openai import OpenAI\n",
+ "import concurrent.futures\n",
+ "from tqdm import tqdm\n",
+ "import os\n",
+ "\n",
+ "client = OpenAI()\n",
+ "\n",
+ "def generate_model_predictions(\n",
+ " subset,\n",
+ " prompt_type,\n",
+ " model_name=\"o4-mini-2025-04-16\",\n",
+ " reasoning_effort=\"medium\",\n",
+ " n_runs=1,\n",
+ " verbose=False,\n",
+ "):\n",
+ " if isinstance(subset, str):\n",
+ " samples_path = f\"data/medical_01_verifiable_problem_{subset}_{prompt_type}_prompt.jsonl\"\n",
+ " with open(samples_path, \"r\", encoding=\"utf-8\") as f:\n",
+ " test_samples = [json.loads(line) for line in f if line.strip()]\n",
+ " else:\n",
+ " test_samples = [subset]\n",
+ "\n",
+ " def run_inference(item):\n",
+ " resp = client.responses.create(\n",
+ " model=model_name,\n",
+ " input=item[\"messages\"],\n",
+ " reasoning={\"effort\": reasoning_effort, \"summary\": \"detailed\"},\n",
+ " )\n",
+ " model_prediction = {'output_text': resp.output_text}\n",
+ " reasoning_tokens_used = resp.usage.output_tokens_details.reasoning_tokens\n",
+ " summaries = [seg.text for item in resp.output if item.type == \"reasoning\" for seg in item.summary]\n",
+ " summaries_string = \"\\n\".join(summaries)\n",
+ " if verbose:\n",
+ " print(\"Prompt: {}\".format(item[\"messages\"][0][\"content\"]))\n",
+ " print(f\"Model Sample: {model_prediction}\\nSolution: {item['reference_answer']}\\n\")\n",
+ " return {\n",
+ " \"model_prediction\": model_prediction[\"output_text\"],\n",
+ " \"input\": item,\n",
+ " \"reasoning_tokens_used\": reasoning_tokens_used,\n",
+ " \"reference_answer\": item[\"reference_answer\"],\n",
+ " \"summaries\": summaries_string\n",
+ " }\n",
+ "\n",
+ " # Ensure the predictions directory exists before any file operations\n",
+ " predictions_dir = os.path.join(\"data\", \"rft\", \"predictions\")\n",
+ " os.makedirs(predictions_dir, exist_ok=True)\n",
+ "\n",
+ " # Check if results already exist for all runs\n",
+ " results_per_run = []\n",
+ " for run_idx in range(n_runs):\n",
+ " run_save_path = os.path.join(\n",
+ " predictions_dir,\n",
+ " f\"{subset}_{prompt_type}_{model_name}_{reasoning_effort}_predictions_run{run_idx+1}.json\"\n",
+ " )\n",
+ " if os.path.exists(run_save_path):\n",
+ " print(f\"Results for run {run_idx+1} already exist at {run_save_path}. Loading results.\")\n",
+ " with open(run_save_path, \"r\", encoding=\"utf-8\") as f:\n",
+ " run_results = json.load(f)\n",
+ " results_per_run.append(run_results)\n",
+ " else:\n",
+ " if len(test_samples) == 1:\n",
+ " run_results = [run_inference(test_samples[0])]\n",
+ " else:\n",
+ " run_results = []\n",
+ " with concurrent.futures.ThreadPoolExecutor() as executor:\n",
+ " futures = [executor.submit(run_inference, item) for item in test_samples]\n",
+ " for future in tqdm(futures, total=len(futures), desc=f\"Generating predictions (run {run_idx+1})\"):\n",
+ " result = future.result()\n",
+ " run_results.append(result)\n",
+ " with open(run_save_path, \"w\", encoding=\"utf-8\") as f:\n",
+ " json.dump(run_results, f, ensure_ascii=False, indent=2)\n",
+ " results_per_run.append(run_results)\n",
+ "\n",
+ " # Return a flat list for backward compatibility\n",
+ " if n_runs == 1:\n",
+ " return results_per_run[0]\n",
+ " else:\n",
+ " return results_per_run"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "To generate the predictions, first make sure your API key is set:\n",
+ "\n",
+ "```bash\n",
+ "export OPENAI_API_KEY=...\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# OpenAI o4-mini model\n",
+ "results_simple_o4mini = generate_model_predictions(\n",
+ " subset=\"train\",\n",
+ " prompt_type=\"simple\",\n",
+ " model_name=\"o4-mini\",\n",
+ " reasoning_effort=\"medium\",\n",
+ " n_runs=3\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# OpenAI o3 model\n",
+ "results_simple_o3 = generate_model_predictions(\n",
+ " subset=\"train\",\n",
+ " prompt_type=\"simple\",\n",
+ " model_name=\"o3\",\n",
+ " reasoning_effort=\"medium\",\n",
+ " n_runs=3\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We now have predictions that are ready to be evaluated.
\n",
+ "We'll build a helper function that allows us to easily swap in different scoring methods,"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import functools\n",
+ "\n",
+ "def evaluate_predictions_with_grader(\n",
+ " predictions,\n",
+ " grader_func=combined_grader,\n",
+ "):\n",
+ " results = []\n",
+ "\n",
+ " if isinstance(predictions, dict):\n",
+ " predictions = [predictions]\n",
+ "\n",
+ " def run_grading(pred):\n",
+ " model_prediction = {\"output_text\": pred[\"model_prediction\"]}\n",
+ " item = pred[\"input\"]\n",
+ " score = grader_func(model_prediction, item)\n",
+ " result = pred.copy()\n",
+ " result[\"score\"] = score\n",
+ " return result\n",
+ "\n",
+ " if len(predictions) == 1:\n",
+ " result = run_grading(predictions[0])\n",
+ " results.append(result)\n",
+ " else:\n",
+ " with concurrent.futures.ThreadPoolExecutor() as executor:\n",
+ " futures = [executor.submit(run_grading, pred) for pred in predictions]\n",
+ " for future in tqdm(concurrent.futures.as_completed(futures), total=len(futures), desc=\"Grading predictions\"):\n",
+ " results.append(future.result())\n",
+ "\n",
+ " total = len(results)\n",
+ " correct = sum(r[\"score\"] for r in results)\n",
+ " accuracy = correct / total if total else 0.0\n",
+ "\n",
+ " metrics = {\n",
+ " \"total_samples\": total,\n",
+ " \"accuracy\": accuracy,\n",
+ " }\n",
+ " print(metrics)\n",
+ " return metrics, results\n",
+ "\n",
+ "def run_prediction_evaluation(\n",
+ " model_name=\"o4-mini\",\n",
+ " reasoning_effort=\"medium\",\n",
+ " prompt_type=\"simple\",\n",
+ " subset=\"train\",\n",
+ " grader_func=combined_grader,\n",
+ " num_runs=3,\n",
+ "):\n",
+ " if isinstance(grader_func, functools.partial):\n",
+ " name = grader_func.func.__name__\n",
+ " mg = grader_func.keywords[\"model_grader\"]\n",
+ " mg_name = mg[\"name\"]\n",
+ " name = f\"{name}_{mg_name}\"\n",
+ " else:\n",
+ " name = getattr(grader_func, \"__name__\", getattr(grader_func, \"__class__\", type(grader_func)).__name__)\n",
+ " grader_func_name = name.replace(\" \", \"_\").replace(\":\", \"_\").replace(\"/\", \"_\").replace(\",\", \"_\")\n",
+ "\n",
+ " for i in range(num_runs):\n",
+ " preds_path = f\"data/rft/predictions/{subset}_{prompt_type}_{model_name}_{reasoning_effort}_predictions_run{i+1}.json\"\n",
+ " with open(preds_path, \"r\") as f:\n",
+ " preds = json.load(f)\n",
+ " metrics, results_with_scores = evaluate_predictions_with_grader(preds, grader_func=grader_func)\n",
+ " # Save the scored results\n",
+ " with open(f\"data/rft/predictions/{subset}_{prompt_type}_{model_name}_{reasoning_effort}_{grader_func_name}_predictions_run_{i+1}_scored.json\", \"w\") as f:\n",
+ " json.dump(results_with_scores, f, indent=2)\n",
+ " # Save the metrics\n",
+ " with open(f\"data/rft/predictions/{subset}_{prompt_type}_{model_name}_{reasoning_effort}_{grader_func_name}_predictions_run_{i+1}_metrics.json\", \"w\") as f:\n",
+ " json.dump(metrics, f, indent=2)\n",
+ " # Save the scores (if present in results_with_scores)\n",
+ " scores = [item.get(\"score\") for item in results_with_scores if \"score\" in item]\n",
+ " with open(f\"data/rft/predictions/{subset}_{prompt_type}_{model_name}_{reasoning_effort}_{grader_func_name}_predictions_run_{i+1}_scores.json\", \"w\") as f:\n",
+ " json.dump(scores, f, indent=2)\n",
+ "\n",
+ "def load_predictions(\n",
+ " model_name=\"o4-mini\",\n",
+ " reasoning_effort=\"medium\",\n",
+ " prompt_type=\"simple\",\n",
+ " subset=\"train\",\n",
+ " grader_func_name=\"clinical_phrase_grader\",\n",
+ " num_runs=3\n",
+ "):\n",
+ " all_predictions = []\n",
+ " all_metrics = []\n",
+ " for run in range(1, num_runs + 1):\n",
+ " pred_path = f\"data/rft/predictions/{subset}_{prompt_type}_{model_name}_{reasoning_effort}_{grader_func_name}_predictions_run_{run}_scored.json\"\n",
+ " metrics_path = f\"data/rft/predictions/{subset}_{prompt_type}_{model_name}_{reasoning_effort}_{grader_func_name}_predictions_run_{run}_metrics.json\"\n",
+ " try:\n",
+ " with open(pred_path, \"r\") as f:\n",
+ " predictions = json.load(f)\n",
+ " except FileNotFoundError:\n",
+ " predictions = None\n",
+ " try:\n",
+ " with open(metrics_path, \"r\") as f:\n",
+ " metrics = json.load(f)\n",
+ " except FileNotFoundError:\n",
+ " metrics = None\n",
+ " all_predictions.append(predictions)\n",
+ " all_metrics.append(metrics)\n",
+ " return all_predictions, all_metrics"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "and then run the evaluations."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 103,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "Grading predictions: 100%|██████████| 100/100 [00:00<00:00, 329740.88it/s]\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "{'total_samples': 100, 'accuracy': 0.5716752010712578}\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "Grading predictions: 100%|██████████| 100/100 [00:00<00:00, 497544.96it/s]\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "{'total_samples': 100, 'accuracy': 0.5855097792577905}\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "Grading predictions: 100%|██████████| 100/100 [00:00<00:00, 414456.92it/s]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "{'total_samples': 100, 'accuracy': 0.5702082734545793}\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "model_name = \"o4-mini\"\n",
+ "reasoning_effort = \"medium\"\n",
+ "prompt_type = \"simple\"\n",
+ "subset = \"train\"\n",
+ "grader_func = combined_grader\n",
+ "grader_func_name = \"combined_grader\"\n",
+ "num_runs = 3\n",
+ "run_prediction_evaluation(\n",
+ " model_name=model_name, \n",
+ " reasoning_effort=reasoning_effort, \n",
+ " prompt_type=prompt_type, \n",
+ " subset=subset, \n",
+ " grader_func=grader_func, \n",
+ " num_runs=num_runs\n",
+ ")\n",
+ "predictions_o4mini_medium_simple_prompt, metrics_o4mini_medium_simple_prompt = load_predictions(model_name=model_name, reasoning_effort=reasoning_effort, prompt_type=prompt_type, subset=subset, grader_func_name=grader_func_name, num_runs=num_runs)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Visualizing the results allows us to spot trends and failure modes."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 115,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n",
+ "Total mistakes: 84\n",
+ "\n",
+ "[Sample 16]\n",
+ " Model prediction: enveloped double stranded linear dna virus\n",
+ " Reference answer: double-stranded, enveloped dna virus\n",
+ " Score: 0.85\n",
+ "\n",
+ "[Sample 19]\n",
+ " Model prediction: gallstone ileus\n",
+ " Reference answer: gall stone ileus\n",
+ " Score: 0.8225806451612904\n",
+ "\n",
+ "[Sample 20]\n",
+ " Model prediction: acute rheumatic fever\n",
+ " Reference answer: postinfectious glomerulonephritis\n",
+ " Score: 0.22037037037037036\n",
+ "\n",
+ "[Sample 22]\n",
+ " Model prediction: amygdala\n",
+ " Reference answer: hippocampus\n",
+ " Score: 0.17894736842105263\n",
+ "\n",
+ "[Sample 23]\n",
+ " Model prediction: hypopituitarism\n",
+ " Reference answer: pituitary adenoma\n",
+ " Score: 0.47812499999999997\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Print mistakes where the model did not get the correct answer (score < 1.0)\n",
+ "mistakes = [\n",
+ " {\"index\": i, **res}\n",
+ " for i, res in enumerate(predictions_o4mini_medium_simple_prompt[0])\n",
+ " if res[\"score\"] < 1.0\n",
+ "]\n",
+ "\n",
+ "print(f\"\\nTotal mistakes: {len(mistakes)}\")\n",
+ "for m in mistakes[15:20]:\n",
+ " print(f\"\\n[Sample {m['index']}]\")\n",
+ " print(f\" Model prediction: {m['model_prediction']}\")\n",
+ " print(f\" Reference answer: {m['reference_answer']}\")\n",
+ " print(f\" Score: {m['score']}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "As observed above, typical failure modes fall into three categories:\n",
+ "1. Small differences and formatting issues, score >=0.8.\n",
+ "2. Partial lexical match, 0.3 < score < 0.8.\n",
+ "3. Lexically off-base, score < 0.3.\n",
+ "\n",
+ "We can visualize the full score distribution on the training set.\n",
+ "\n",
+ "> **Note:** : In practice, analyzing model errors at scale often involves a mix of manual review and automated methods-like tagging failure types or clustering predictions by score and content. That workflow is beyond the scope of this guide, but it's a valuable next step once you've identified broad patterns."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "execution_count": 84,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "import matplotlib.pyplot as plt\n",
+ "scores_distribution = [m['score'] for m in predictions_o4mini_medium_simple_prompt[0]]\n",
+ "plt.hist(scores_distribution, alpha=0.6, label='o4-mini medium simple prompt')\n",
+ "plt.legend()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Let's compare with other models and prompts, and visualize scores."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 104,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "Grading predictions: 100%|██████████| 100/100 [00:00<00:00, 489988.79it/s]\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "{'total_samples': 100, 'accuracy': 0.6150339441350683}\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "Grading predictions: 100%|██████████| 100/100 [00:00<00:00, 507170.98it/s]\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "{'total_samples': 100, 'accuracy': 0.5901906182115139}\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "Grading predictions: 100%|██████████| 100/100 [00:00<00:00, 543303.63it/s]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "{'total_samples': 100, 'accuracy': 0.5927679005876193}\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "# OpenAI o3 model\n",
+ "model_name = \"o3\"\n",
+ "reasoning_effort = \"medium\"\n",
+ "prompt_type = \"simple\"\n",
+ "subset = \"train\"\n",
+ "grader_func = combined_grader\n",
+ "grader_func_name = \"combined_grader\"\n",
+ "num_runs = 3\n",
+ "run_prediction_evaluation(model_name=model_name, reasoning_effort=reasoning_effort, prompt_type=prompt_type, subset=subset, grader_func=grader_func, num_runs=num_runs)\n",
+ "predictions_o3_medium_simple_prompt, metrics_o3_medium_simple_prompt = load_predictions(model_name=model_name, reasoning_effort=reasoning_effort, prompt_type=prompt_type, subset=subset, grader_func_name=grader_func_name, num_runs=num_runs)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 106,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "import seaborn as sns\n",
+ "\n",
+ "def average_and_std_metrics(metrics_list):\n",
+ " \"\"\"Returns dicts of mean and std for a list of metrics dicts.\"\"\"\n",
+ " if not metrics_list: return {}, {}\n",
+ " keys = metrics_list[0].keys()\n",
+ " arr = {k: np.array([m[k] for m in metrics_list]) for k in keys}\n",
+ " mean = {k: float(np.mean(arr[k])) for k in keys}\n",
+ " std = {k: float(np.std(arr[k])) for k in keys}\n",
+ " return mean, std\n",
+ "\n",
+ "def plot_model_accuracies(model_metrics_avg, model_metrics_std, grader_title=\"Combined Grader Accuracy\", sharey: bool = True) -> None:\n",
+ " \"\"\"Plots model accuracies with standard deviation error bars.\"\"\"\n",
+ " # Convert the nested dicts into tidy DataFrames\n",
+ " df_avg = pd.DataFrame(model_metrics_avg).T.reset_index().rename(columns={\"index\": \"Model\"})\n",
+ " df_std = pd.DataFrame(model_metrics_std).T.reset_index().rename(columns={\"index\": \"Model\"})\n",
+ "\n",
+ " # Long-form for Seaborn\n",
+ " long_df_avg = df_avg.melt(id_vars=\"Model\", value_vars=[\"accuracy\"], var_name=\"Metric\", value_name=\"Accuracy\")\n",
+ " long_df_std = df_std.melt(id_vars=\"Model\", value_vars=[\"accuracy\"], var_name=\"Metric\", value_name=\"Std\")\n",
+ "\n",
+ " # Merge avg and std for error bars\n",
+ " long_df = pd.merge(long_df_avg, long_df_std, on=[\"Model\", \"Metric\"])\n",
+ "\n",
+ " pretty_names = {\"accuracy\": grader_title}\n",
+ "\n",
+ " # Create a separate figure for each metric\n",
+ " for metric_key in [\"accuracy\"]:\n",
+ " metric_df = long_df[long_df[\"Metric\"] == metric_key].copy()\n",
+ " plt.figure(figsize=(8, 5))\n",
+ " # Plot bars with error bars\n",
+ " ax = sns.barplot(data=metric_df, x=\"Model\", y=\"Accuracy\", hue=\"Model\", palette=\"tab10\", legend=False, errorbar=None)\n",
+ " bars = ax.patches\n",
+ " # Add error bars manually\n",
+ " for i, row in enumerate(metric_df.itertuples()):\n",
+ " bar = bars[i]\n",
+ " x = bar.get_x() + bar.get_width() / 2\n",
+ " y = row.Accuracy\n",
+ " yerr = row.Std\n",
+ " ax.errorbar(x=x, y=y, yerr=yerr, fmt='none', ecolor='black', capsize=5, elinewidth=2, capthick=2, zorder=10)\n",
+ " plt.title(pretty_names[metric_key])\n",
+ " plt.ylabel(\"Accuracy\")\n",
+ " plt.xlabel(\"\")\n",
+ " if sharey: plt.ylim(0, 1)\n",
+ " # Annotate bars with exact values\n",
+ " for bar in bars:\n",
+ " height = bar.get_height()\n",
+ " ax.annotate(f\"{height:.2f}\", xy=(bar.get_x() + bar.get_width() / 2, height), xytext=(0, 6), textcoords=\"offset points\", ha='center', va='bottom', fontsize=10, fontweight='bold')\n",
+ " plt.xticks(rotation=15, ha=\"right\")\n",
+ " plt.tight_layout()\n",
+ " plt.show()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 107,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "avg_metrics_o4mini_medium_simple_prompt, std_metrics_o4mini_medium_simple_prompt = average_and_std_metrics(metrics_o4mini_medium_simple_prompt)\n",
+ "avg_metrics_o3_medium_simple_prompt, std_metrics_o3_medium_simple_prompt = average_and_std_metrics(metrics_o3_medium_simple_prompt)\n",
+ "model_metrics_avg = {\n",
+ " \"o4-mini-medium-simple-prompt\": avg_metrics_o4mini_medium_simple_prompt,\n",
+ " \"o3-medium-simple-prompt\": avg_metrics_o3_medium_simple_prompt,\n",
+ "}\n",
+ "model_metrics_std = {\n",
+ " \"o4-mini-medium-simple-prompt\": std_metrics_o4mini_medium_simple_prompt,\n",
+ " \"o3-medium-simple-prompt\": std_metrics_o3_medium_simple_prompt,\n",
+ "}\n",
+ "plot_model_accuracies(model_metrics_avg, model_metrics_std, grader_title=\"Combined Grader Accuracy\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can see that the modelʼs performance has clear limits. In practice, iterating on the prompt often helps boost baseline results and get more out of the base model. However, in this case, our prompt engineering didnʼt lead to meaningful improvements-so we excluded those runs from the analysis.\n",
+ "\n",
+ "\n",
+ "A key requirement for RFT to work is that the base model demonstrates it can successfully complete the task for at least some examples right out of the gate. The initial accuracy of ~0.6 is a strong signal that RFT can boost performance. If the model never succeeds on your tasks, there is no training signal to hill climb on.\n",
+ "\n",
+ "\n",
+ "This evaluation process prepares us for the next step: guiding the model with structured, high-quality feedback from a grader.\n",
+ "\n",
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## **4. Defining Your Grader**\n",
+ "\n",
+ "The grader defines the reward function that shapes model behavior during RFT. It provides examples of desired outputs-and penalizes undesirable ones. Designing an effective grader requires both principled structure and thoughtful domain insight, and is perhaps the most important task for successful RFT. \n",
+ "\n",
+ "In this section, we will present 3 graders, show how they should be set up to fit the API, and discuss the results they yielded. We will then show how to actually launch an RFT task. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### String based grader\n",
+ "We began with a dual grader using our earlier evaluation functions since it provides a distribution of scores that will be aligned with the lexical proximity of the prediction to the reference answer. It provided a starting point, but the signal wasnʼt rich enough for `o4-mini` to truly learn and improve, and a first experiment showed stagnant reward during the RFT run. For the API calls, you should build the python grading function as shown below. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import inspect\n",
+ "\n",
+ "# --- Utility functions ---\n",
+ "def build_python_grader_payload(grader_fn) :\n",
+ " \"\"\"Build a payload for a python grader.\"\"\"\n",
+ " grader_source = inspect.getsource(grader_fn)\n",
+ " # Enforce function name to be `grade`\n",
+ " grader_source = grader_source.replace(grader_fn.__name__, \"grade\", 1)\n",
+ " return {\n",
+ " \"type\": \"python\",\n",
+ " \"source\": grader_source,\n",
+ " }\n",
+ "\n",
+ "multi_python_grader_tool_call = {\n",
+ " \"type\": \"multi\",\n",
+ " \"graders\": {\n",
+ " \"clinical_phrase\": {\n",
+ " \"name\": \"clinical_phrase_grader\",\n",
+ " \"image_tag\": \"2025-05-08\",\n",
+ " **build_python_grader_payload(clinical_phrase_grader),\n",
+ " },\n",
+ " \"clinical_phrase_binary\": {\n",
+ " \"name\": \"clinical_phrase_binary_grader\",\n",
+ " \"image_tag\": \"2025-05-08\",\n",
+ " **build_python_grader_payload(clinical_phrase_binary_grader),\n",
+ " },\n",
+ " },\n",
+ " \"calculate_output\": \"0.85 * clinical_phrase + 0.15 * clinical_phrase_binary\",\n",
+ "}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Here is a snapshot of its training curves, where the green curve is the traning set reward and the blue curve is the test set reward:\n",
+ "\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Model Grader 1\n",
+ "To address this limitation, we introduced a more advanced approach: the **model grader**. A model-based grader lets us embed semantic understanding and nuance into the feedback. Thatʼs especially powerful when domain-specific synonyms or fuzzy reasoning are in play. \n",
+ "\n",
+ "We used gpt-4.1 as our grader model, guided by a rubric that emphasized semantic fidelity: clinical synonymy, correct disease categorization, and conceptual alignment. Rather than focusing on superficial phrasing-e.g., \"Is this the same string?\"-the grader aimed to answer, \"Does this reflect the correct outcome or phenomenon?\" \n",
+ "\n",
+ "To ensure the grader aligned with expert expectations, we evaluated it on a subset of base model predictions. For any production use-case, domain expert reviewers should verify that model assigned scores reflect preferred answer orderings and align with domain judgment. This typically involves confirming that the model grader correctly ranks predictions according to their validity. In the scope of this cookbook, we approximated this evaluation by using OpenAI `o3` to check whether higher-quality predictions were consistently rewarded relative to their alternatives.\n",
+ "\n",
+ "From these discussions of `o3` , we iteratively update the model grader until the results are aligned. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "GRADER_PROMPT_1 = \"\"\"\n",
+ "System:\n",
+ " You are an expert medical grader. Compare the **Reference Answer** to the **Model's Answer** and produce **only** a JSON object with:\n",
+ " • **result**: a float between 0.0 and 1.0 \n",
+ " • **steps**: a list of reasoning steps (each with a `\"description\"` and a `\"conclusion\"`)\n",
+ "\n",
+ " Scoring rubric (start at 0.0, then add or subtract):\n",
+ " 1. Exact lexical match: **+0.15** \n",
+ " 2. Clinical synonym (e.g. “withdrawal of thought” ↔ “thought withdrawal”): **+0.35** \n",
+ " 3. Same disease family (e.g. two viral encephalitides): **+0.35** \n",
+ " 4. Partial term overlap (e.g. “ulcer” in both phrases): **+0.15** \n",
+ " 5. Completely unrelated: **-0.10**\n",
+ "\n",
+ " • If multiple criteria apply, sum their weights (max 1.0). \n",
+ " • Cap the final score to the [0.0, 1.0] range. \n",
+ " • In your **steps**, show which rule you applied and the running subtotal.\n",
+ "\"\"\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "To be submitted through the API, this is how the dictionary is built."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "model_grader_1 = {\n",
+ " \"type\": \"score_model\",\n",
+ " \"name\": \"gpt41_score_model_1\",\n",
+ " \"input\": [\n",
+ " {\n",
+ " \"role\": \"system\",\n",
+ " \"content\": GRADER_PROMPT_1\n",
+ " },\n",
+ " {\n",
+ " \"role\": \"user\",\n",
+ " \"content\": \"Reference Answer: {{item.reference_answer}}. Model's Answer: {{sample.output_text}}\"\n",
+ " }\n",
+ " ],\n",
+ " \"pass_threshold\": 0.75,\n",
+ " \"model\": \"gpt-4.1-2025-04-14\",\n",
+ " \"range\": [0, 1],\n",
+ " \"sampling_params\": {\n",
+ " \"seed\": 42,\n",
+ " \"temperature\": 0,\n",
+ " },\n",
+ "}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Accordingly, we set up the model grader locally to check the results of the models we will fine-tune next. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "\n",
+ "from pydantic import BaseModel\n",
+ "from typing import List\n",
+ "\n",
+ "class GraderStep(BaseModel):\n",
+ " description: str\n",
+ " conclusion: str\n",
+ "\n",
+ "class GraderResponse(BaseModel):\n",
+ " result: float\n",
+ " steps: List[GraderStep]\n",
+ "\n",
+ "# Adapted python_model_grader to match the other graders' interface\n",
+ "def python_model_grader(sample, item, model_grader=model_grader_1):\n",
+ " \"\"\"\n",
+ " Calls an OpenAI model to grade the model output against the reference answer.\n",
+ " Expects sample to have \"output_text\", item to have \"reference_answer\".\n",
+ " Returns a float score (parsed from the model's JSON response).\n",
+ " \"\"\"\n",
+ " # Prepare the prompt as the grader expects\n",
+ " system_prompt = model_grader[\"input\"][0][\"content\"]\n",
+ " user_prompt = model_grader[\"input\"][1][\"content\"]\n",
+ " user_prompt_filled = user_prompt.replace(\"{{item.reference_answer}}\", item[\"reference_answer\"]).replace(\"{{sample.output_text}}\", sample[\"output_text\"])\n",
+ " messages = [\n",
+ " {\"role\": \"system\", \"content\": system_prompt},\n",
+ " {\"role\": \"user\", \"content\": user_prompt_filled}\n",
+ " ]\n",
+ " # Call the OpenAI API with the grader's model\n",
+ " response = client.beta.chat.completions.parse(\n",
+ " model=model_grader[\"model\"],\n",
+ " messages=messages,\n",
+ " seed=model_grader.get(\"sampling_params\", {}).get(\"seed\", None),\n",
+ " temperature=model_grader.get(\"sampling_params\", {}).get(\"temperature\", 0),\n",
+ " response_format=GraderResponse,\n",
+ " )\n",
+ " # Parse the float score from the model's JSON response\n",
+ " parsed = response.choices[0].message.parsed\n",
+ " if not isinstance(parsed, GraderResponse):\n",
+ " raise RuntimeError(f\"Grader returned invalid structured output: {parsed!r}\")\n",
+ " return float(parsed.result)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "While the rubric initially delivered sensible feedback, the model soon uncovered a loophole and began **reward-hacking**. Scores shot up-sometimes by 20-30 percentage points-not because clinical accuracy improved but because the model padded its “one phrase” answers with synonyms, doses, and full management plans. You might see `begin warfarin therapy **and** continue unfractionated heparin for ≥5 days, overlapping until the INR is in the therapeutic range (2–3)` or `chewable aspirin 325 mg stat plus nitroglycerin…` instead of the required `continue unfractionated heparin` or `aspirin` respectively. Although the system prompt is explicit-*“respond with exactly one phrase: the single most likely outcome or phenomenon”*-these verbose outputs inflate *lexical_similarity* scores without precisely adding prediction value. This experience highlights the need to continuously inspect model outputs and remain vigilant for reward-hacking behaviours that can quietly distort evaluation metrics."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Here is a snapshot of its training curves (green is training reward, blue is test reward):\n",
+ "\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Model Grader 2\n",
+ "To mitigate this reward-hack, we refined the grader prompt by clarifying expectations, enforcing stricter output constraints, and supplying contrastive examples of correct versus incorrect behavior. Once again, we've iterated with `o3`, leveraging predictions from the base `o4-mini` and the previous fine-tuned model hacking examples, to design and validate our grader. Another important point of this updated grader is the reduction of the weight of the *lexical_similarity*, to ensure that *clinical_similarity* prevails."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 91,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "GRADER_PROMPT_2 = \"\"\"You are an expert medical grader.\n",
+ "\n",
+ "Compare the reference_answer (gold standard) with the model_prediction\n",
+ "and return **exactly** this JSON object:\n",
+ "\n",
+ "{\n",
+ " \"steps\": [ // each: {\"description\": \"...\", \"conclusion\": \"...\"}\n",
+ " …\n",
+ " ],\n",
+ " \"result\": \n",
+ "}\n",
+ "\n",
+ "──────────────── Input placeholders ───────────────\n",
+ "reference_answer:\n",
+ "model_prediction:\n",
+ "\n",
+ "──────────── Normalisation steps ────────────\n",
+ "• lowercase, strip punctuation / excess whitespace \n",
+ "• expand common abbreviations (e.g. cll → chronic lymphocytic leukemia) \n",
+ "• map both strings to ICD-10 / SNOMED concepts when possible\n",
+ "\n",
+ "──────────── Clinical layer rubric ───────────\n",
+ "L1 exact concept or universally accepted synonym \n",
+ "L2 same concept but benign modifier differs (e.g. “acute”, “left”) \n",
+ "L3 same disease / drug family but wrong subtype or variant \n",
+ "L4 same organ system but entirely different disease / intervention \n",
+ "L5 only partial mechanistic overlap (e.g. both vasodilators) \n",
+ "L6 unrelated or nonsensical\n",
+ "\n",
+ "──────────── Scoring parameters ─────────────\n",
+ "clinical_weight = 0.90\n",
+ "lexical_weight = 0.10\n",
+ "clinical_similarity = {1:1.00, 2:0.85, 3:0.45, 4:0.30, 5:0.10, 6:0.00}\n",
+ "\n",
+ "lexical_similarity = normalized_levenshtein(reference_answer,\n",
+ " model_prediction)\n",
+ "\n",
+ "# Optional penalty if a clinically critical adjective is missing\n",
+ "critical_modifiers = [\n",
+ " \"wide\", \"narrow\", \"acute\", \"chronic\", \"posteromedial\",\n",
+ " \"oxidized\", \"oxidised\", \"left\", \"right\"\n",
+ "]\n",
+ "modifier_pen = -0.05 if any(\n",
+ " w in reference_answer and w not in model_prediction\n",
+ " for w in critical_modifiers\n",
+ ") else 0.0\n",
+ "\n",
+ "# Determine layer L (1-6) per rubric above using ontology + judgment.\n",
+ "if L == 6:\n",
+ " score = 0.0\n",
+ "else:\n",
+ " score = (clinical_weight * clinical_similarity[L] +\n",
+ " lexical_weight * lexical_similarity) + modifier_pen\n",
+ "\n",
+ "Clamp to [0,1] and round to 3 decimals. \n",
+ "Output **only** the JSON.\n",
+ "\n",
+ "──────────────── Worked examples ─────────────\n",
+ "reference_answer: beta-thalassemia major \n",
+ "model_prediction: beta-thalassemia minor \n",
+ "reasoning: Both involve β-globin chain synthesis, but “major” causes\n",
+ " transfusion-dependent anemia while “minor” is largely benign;\n",
+ " same family, wrong subtype → **L3**. Lexical ≈ 0.83. \n",
+ "score = 0.90·0.45 + 0.10·0.83 = 0.488 → **0.488**\n",
+ "\n",
+ "reference_answer: ACE inhibitor \n",
+ "model_prediction: angiotensin-receptor blocker \n",
+ "reasoning: Both act on the renin–angiotensin axis yet on different\n",
+ " targets; only partial mechanistic overlap → **L5**.\n",
+ " Lexical ≈ 0.31. \n",
+ "score = 0.90·0.10 + 0.10·0.31 = 0.121 → **0.121**\n",
+ "\n",
+ "reference_answer: acute pancreatitis \n",
+ "model_prediction: pancreatitis \n",
+ "reasoning: Same disorder but missing timing adjective “acute”;\n",
+ " benign modifier difference → **L2**. Lexical ≈ 0.78. \n",
+ "score = 0.90·0.85 + 0.10·0.78 = 0.843 → **0.843**\n",
+ "\n",
+ "reference_answer: valproate \n",
+ "model_prediction: valproic acid \n",
+ "reasoning: Valproic acid is the active moiety of valproate; mechanisms\n",
+ " and indications are identical → **L1**. Lexical ≈ 0.82. \n",
+ "score = 0.90·1.00 + 0.10·0.82 = 0.982 → **0.982**\n",
+ "\n",
+ "reference_answer: riboflavin \n",
+ "model_prediction: riboflavin deficiency \n",
+ "reasoning: Adds “deficiency” but refers to the same vitamin (B₂);\n",
+ " benign modifier difference → **L2**. Lexical ≈ 0.60. \n",
+ "score = 0.90·0.85 + 0.10·0.60 = 0.825 → **0.825**\n",
+ "\n",
+ "reference_answer: splenectomy \n",
+ "model_prediction: acetaminophen overdose \n",
+ "reasoning: Surgical removal of the spleen has no mechanistic or anatomic\n",
+ " relationship to toxic drug ingestion → **L6**. \n",
+ "score = **0.000**\n",
+ "\n",
+ "reference_answer: ulcerative colitis \n",
+ "model_prediction: Crohn disease \n",
+ "reasoning: Both are inflammatory-bowel diseases but differ in location,\n",
+ " histology and management; same organ system, different disease\n",
+ " → **L4**. Lexical ≈ 0.38. \n",
+ "score = 0.90·0.30 + 0.10·0.38 = 0.308 → **0.308**\"\"\""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 92,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "model_grader_2 = {\n",
+ " \"type\": \"score_model\",\n",
+ " \"name\": \"gpt41_score_model_2\",\n",
+ " \"input\": [\n",
+ " {\n",
+ " \"role\": \"system\",\n",
+ " \"content\": GRADER_PROMPT_2\n",
+ " },\n",
+ " {\n",
+ " \"role\": \"user\",\n",
+ " \"content\": \"Reference Answer: {{item.reference_answer}}. Model's Answer: {{sample.output_text}}\"\n",
+ " }\n",
+ " ],\n",
+ " \"pass_threshold\": 0.75,\n",
+ " \"model\": \"gpt-4.1-2025-04-14\",\n",
+ " \"range\": [0, 1],\n",
+ " \"sampling_params\": {\n",
+ " \"seed\": 42,\n",
+ " \"temperature\": 0,\n",
+ " },\n",
+ "}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "The final result was a high-signal, domain-sensitive grader that guided the model toward more appropriate and concise predictions.\n",
+ "\n",
+ "**Note on cost:** LLM graders incur token usage charges in addition to training compute. To manage costs effectively, we recommend:\n",
+ "1. Testing your grader locally on base model completions (and optionally synthetic ones) to ensure it aligns with your rubric or human preferences. When available, use [flex processing](https://platform.openai.com/docs/guides/flex-processing) for more efficient evaluation.\n",
+ "2. Starting with a small-scale RFT run to validate grader alignment and detect potential reward-hacking before scaling up.\n",
+ "\n",
+ "Let's look at how to launch the training in the next step!\n",
+ "\n",
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## **5. Training**\n",
+ "\n",
+ "Once your prompt and grader are finalized, you can proceed to training. This section shows how to launch RFT using your final grader-but naturally, you would have already run similar commands when experimenting with earlier grader versions to evaluate their performance."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We make sure the grader passed API test,"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import requests\n",
+ "\n",
+ "API_KEY = os.environ[\"OPENAI_API_KEY\"]\n",
+ "HEADERS = {\"Authorization\": f\"Bearer {API_KEY}\"}\n",
+ "\n",
+ "# Validate a grader configuration for fine-tuning\n",
+ "payload = {\"grader\": model_grader_2}\n",
+ "try:\n",
+ " response = requests.post(\n",
+ " \"https://api.openai.com/v1/fine_tuning/alpha/graders/validate\",\n",
+ " json=payload,\n",
+ " headers=HEADERS,\n",
+ " )\n",
+ " response.raise_for_status()\n",
+ " print(\"Grader validated\")\n",
+ "except requests.exceptions.RequestException as e:\n",
+ " print(f\"Error validating grader: {e}\")\n",
+ " if 'response' in locals():\n",
+ " print(f\"Response: {response.text}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "and upload the training and test sets to the OpenAI file system."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Set your training and test file paths\n",
+ "train_file = \"data/medical_01_verifiable_problem_train_with_prompt.jsonl\"\n",
+ "test_file = \"data/medical_01_verifiable_problem_val_with_prompt.jsonl\"\n",
+ "\n",
+ "def upload_file(file_path: str) -> str:\n",
+ " \"\"\"Upload a file to the OpenAI platform for fine-tuning.\"\"\"\n",
+ " print(f\"Uploading file: {file_path}\")\n",
+ " with open(file_path, 'rb') as f:\n",
+ " response = requests.post(\n",
+ " \"https://api.openai.com/v1/files\",\n",
+ " headers=HEADERS,\n",
+ " files={\"file\": f},\n",
+ " data={\"purpose\": \"fine-tune\"}\n",
+ " )\n",
+ " response.raise_for_status()\n",
+ " file_id = response.json()[\"id\"]\n",
+ " print(f\"File uploaded successfully. File ID: {file_id}\")\n",
+ " return file_id\n",
+ "\n",
+ "train_file_id = train_file\n",
+ "if train_file.endswith(\"jsonl\"):\n",
+ " print(f\"Training file detected: {train_file}\")\n",
+ " train_file_id = upload_file(train_file)\n",
+ "test_file_id = test_file\n",
+ "if test_file and test_file.endswith(\"jsonl\"):\n",
+ " print(f\"test file detected: {test_file}\")\n",
+ " test_file_id = upload_file(test_file)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Let's now define the hyper-parameters for our run. We will be fine-tuning `o4-mini`, with the `medium` reasoning effort. This parameter will impact the length by limiting the number of tokens the model uses to reason. We tune with a moderate compute multiplier and reasonable number of epochs, prioritizing efficiency and fast iteration. You’ll want to tailor these depending on your budget, desired generalization, and dataset difficulty."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Set the model and other parameters\n",
+ "model = \"o4-mini-2025-04-16\"\n",
+ "suffix = \"medical_01_verifiable_problem_gpt41_grader\"\n",
+ "reasoning_effort = \"medium\"\n",
+ "n_epochs = 5\n",
+ "seed = 42\n",
+ "grader = model_grader_2\n",
+ "response_format = None\n",
+ "compute_multiplier = 1.0\n",
+ "etest_samples = 1\n",
+ "eval_interval = 5"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We are now ready to launch the run!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Launch the RFT job\n",
+ "payload = dict(\n",
+ " training_file=train_file_id,\n",
+ " test_file=test_file_id,\n",
+ " model=model,\n",
+ " suffix=suffix,\n",
+ " method=dict(\n",
+ " type=\"reinforcement\",\n",
+ " reinforcement=dict(\n",
+ " grader=grader,\n",
+ " response_format=response_format,\n",
+ " hyperparameters=dict(\n",
+ " compute_multiplier=compute_multiplier,\n",
+ " etest_samples=etest_samples,\n",
+ " eval_interval=eval_interval,\n",
+ " n_epochs=n_epochs,\n",
+ " reasoning_effort=reasoning_effort,\n",
+ " )\n",
+ " )\n",
+ " ),\n",
+ " seed=seed\n",
+ ")\n",
+ "\n",
+ "try:\n",
+ " response = requests.post(\n",
+ " \"https://api.openai.com/v1/fine_tuning/jobs\",\n",
+ " json=payload,\n",
+ " headers=HEADERS,\n",
+ " )\n",
+ " response.raise_for_status()\n",
+ " job_id = response.json().get(\"id\")\n",
+ " if job_id:\n",
+ " print(\"Training job created with ID:\", job_id)\n",
+ " print(\n",
+ " f\"View the job details at: https://platform.openai.com/finetune/{job_id}\")\n",
+ " else:\n",
+ " print(\"Failed to retrieve job ID from response.\")\n",
+ "except requests.exceptions.RequestException as e:\n",
+ " print(f\"An error occurred while creating the training job: {e}\")\n",
+ " if 'response' in locals():\n",
+ " print(f\"Response: {response.text}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "On the [dashboard](https://platform.openai.com/finetune/) you can observe the reward plots - they let you watch overall performance improve across steps, while the per-grader charts break down specific components in the case of a *multi_grader*. Reasoning token usage trends (often decreasing as the model gets more confident) and step duration metrics give insight into efficiency. Grader latency and error count plots help ensure your grader stays performant and bug-free during the run.\n",
+ "\n",
+ "Here is a snapshot of our training curves, where the green and orange curves are for the training set, while tbe blue and red curves are for the test subset:\n",
+ "\n",
+ "\n",
+ "\n",
+ "During training, evaluation runs on the test set are logged directly to the [Evaluation API](https://platform.openai.com/evaluations?tab=runs). You can head there to track how your samples perform and get a sense of how predictions evolve over time.\n",
+ "\n",
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## **6. Using Your Fine-Tuned Model**\n",
+ "\n",
+ "When training completes, you can call your new model by its `model_id` and benchmark its improvements. Expect sharper predictions! \n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# To retrieve information about a fine-tuning job (including the fine-tuned model id), use the job_id:\n",
+ "response = requests.get(\n",
+ " f\"https://api.openai.com/v1/fine_tuning/jobs/{job_id}\",\n",
+ " headers=HEADERS,\n",
+ ")\n",
+ "if response.ok:\n",
+ " data = response.json()\n",
+ " if data.get(\"status\") == \"succeeded\":\n",
+ " fine_tuned_model_id = data.get(\"fine_tuned_model\")\n",
+ " else:\n",
+ " fine_tuned_model_id = None\n",
+ "else:\n",
+ " raise Exception(f\"Request failed: {response.status_code} - {response.text}\")\n",
+ "print(\"Fine-tuned model id:\", fine_tuned_model_id)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Model's prediction scores\n",
+ "\n",
+ "Let's compute the scores of our base and fine-tuned models for comparison."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "Generating predictions (run 1): 0%| | 0/100 [00:00, ?it/s]"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "Generating predictions (run 1): 100%|██████████| 100/100 [02:27<00:00, 1.47s/it]\n",
+ "Generating predictions (run 2): 100%|██████████| 100/100 [02:28<00:00, 1.49s/it]\n",
+ "Generating predictions (run 3): 100%|██████████| 100/100 [02:13<00:00, 1.33s/it]\n",
+ "Grading predictions: 100%|██████████| 100/100 [00:23<00:00, 4.30it/s]\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "{'total_samples': 100, 'accuracy': 0.7207700000000001}\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "Grading predictions: 100%|██████████| 100/100 [00:29<00:00, 3.43it/s]\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "{'total_samples': 100, 'accuracy': 0.7125700000000001}\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "Grading predictions: 100%|██████████| 100/100 [00:22<00:00, 4.39it/s]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "{'total_samples': 100, 'accuracy': 0.7239800000000003}\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "from functools import partial\n",
+ "model_name = fine_tuned_model_id\n",
+ "reasoning_effort = \"medium\"\n",
+ "prompt_type = \"simple\"\n",
+ "subset = \"val\"\n",
+ "grader_func = partial(python_model_grader, model_grader=model_grader_2)\n",
+ "grader_func_name = \"python_model_grader_gpt41_score_model_2\"\n",
+ "num_runs = 3\n",
+ "\n",
+ "results_ft_model_grader_2 = generate_model_predictions(\n",
+ " subset=subset,\n",
+ " prompt_type=prompt_type,\n",
+ " model_name=model_name,\n",
+ " reasoning_effort=reasoning_effort,\n",
+ " n_runs=num_runs\n",
+ ")\n",
+ "run_prediction_evaluation(\n",
+ " model_name=model_name, \n",
+ " reasoning_effort=reasoning_effort, \n",
+ " prompt_type=prompt_type, \n",
+ " subset=subset,\n",
+ " grader_func=grader_func, \n",
+ " num_runs=num_runs\n",
+ ")\n",
+ "predictions_ftmodel_medium_simple_prompt_model_grader_2, metrics_ftmodel_medium_simple_prompt_model_grader_2 = load_predictions(\n",
+ " model_name=model_name,\n",
+ " reasoning_effort=reasoning_effort,\n",
+ " prompt_type=prompt_type,\n",
+ " subset=subset,\n",
+ " grader_func_name=grader_func_name,\n",
+ " num_runs=num_runs\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Results for run 1 already exist at data/rft/predictions/val_simple_o4-mini_medium_predictions_run1.json. Loading results.\n",
+ "Results for run 2 already exist at data/rft/predictions/val_simple_o4-mini_medium_predictions_run2.json. Loading results.\n",
+ "Results for run 3 already exist at data/rft/predictions/val_simple_o4-mini_medium_predictions_run3.json. Loading results.\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "Grading predictions: 100%|██████████| 100/100 [00:21<00:00, 4.57it/s]\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "{'total_samples': 100, 'accuracy': 0.6749300000000003}\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "Grading predictions: 100%|██████████| 100/100 [00:20<00:00, 4.96it/s]\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "{'total_samples': 100, 'accuracy': 0.6755199999999999}\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "Grading predictions: 100%|██████████| 100/100 [00:24<00:00, 4.16it/s]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "{'total_samples': 100, 'accuracy': 0.64916}\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "model_name = \"o4-mini\"\n",
+ "reasoning_effort = \"medium\"\n",
+ "prompt_type = \"simple\"\n",
+ "subset = \"val\"\n",
+ "grader_func = partial(python_model_grader, model_grader=model_grader_2)\n",
+ "grader_func_name = \"python_model_grader_gpt41_score_model_2\"\n",
+ "num_runs = 3\n",
+ "\n",
+ "results_o4mini_model_grader_2 = generate_model_predictions(\n",
+ " subset=subset,\n",
+ " prompt_type=prompt_type,\n",
+ " model_name=model_name,\n",
+ " reasoning_effort=reasoning_effort,\n",
+ " n_runs=num_runs\n",
+ ")\n",
+ "run_prediction_evaluation(\n",
+ " model_name=model_name, \n",
+ " reasoning_effort=reasoning_effort, \n",
+ " prompt_type=prompt_type, \n",
+ " subset=subset,\n",
+ " grader_func=grader_func, \n",
+ " num_runs=num_runs\n",
+ ")\n",
+ "predictions_o4mini_medium_simple_prompt_model_grader_2, metrics_o4mini_medium_simple_prompt_model_grader_2 = load_predictions(\n",
+ " model_name=model_name,\n",
+ " reasoning_effort=reasoning_effort,\n",
+ " prompt_type=prompt_type,\n",
+ " subset=subset,\n",
+ " grader_func_name=grader_func_name,\n",
+ " num_runs=num_runs\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 39,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Results for run 1 already exist at data/rft/predictions/val_simple_o3_medium_predictions_run1.json. Loading results.\n",
+ "Results for run 2 already exist at data/rft/predictions/val_simple_o3_medium_predictions_run2.json. Loading results.\n",
+ "Results for run 3 already exist at data/rft/predictions/val_simple_o3_medium_predictions_run3.json. Loading results.\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "Grading predictions: 100%|██████████| 100/100 [00:32<00:00, 3.10it/s]\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "{'total_samples': 100, 'accuracy': 0.6493800000000001}\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "Grading predictions: 100%|██████████| 100/100 [00:20<00:00, 4.89it/s]\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "{'total_samples': 100, 'accuracy': 0.6722}\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "Grading predictions: 100%|██████████| 100/100 [00:20<00:00, 4.80it/s]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "{'total_samples': 100, 'accuracy': 0.7137200000000001}\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "model_name = \"o3\"\n",
+ "reasoning_effort = \"medium\"\n",
+ "prompt_type = \"simple\"\n",
+ "subset = \"val\"\n",
+ "grader_func = partial(python_model_grader, model_grader=model_grader_2)\n",
+ "grader_func_name = \"python_model_grader_gpt41_score_model_2\"\n",
+ "num_runs = 3\n",
+ "\n",
+ "results_o3_model_grader_2 = generate_model_predictions(\n",
+ " subset=subset,\n",
+ " prompt_type=prompt_type,\n",
+ " model_name=model_name,\n",
+ " reasoning_effort=reasoning_effort,\n",
+ " n_runs=num_runs\n",
+ ")\n",
+ "run_prediction_evaluation(\n",
+ " model_name=model_name, \n",
+ " reasoning_effort=reasoning_effort, \n",
+ " prompt_type=prompt_type, \n",
+ " subset=subset,\n",
+ " grader_func=grader_func, \n",
+ " num_runs=num_runs\n",
+ ")\n",
+ "predictions_o3_medium_simple_prompt_model_grader_2, metrics_o3_medium_simple_prompt_model_grader_2 = load_predictions(\n",
+ " model_name=model_name,\n",
+ " reasoning_effort=reasoning_effort,\n",
+ " prompt_type=prompt_type,\n",
+ " subset=subset,\n",
+ " grader_func_name=grader_func_name,\n",
+ " num_runs=num_runs\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can now visualize them!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 45,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "avg_metrics_o4mini_medium_simple_prompt_model_grader_2, std_metrics_o4mini_medium_simple_prompt_model_grader_2 = average_and_std_metrics(metrics_o4mini_medium_simple_prompt_model_grader_2)\n",
+ "avg_metrics_o3_medium_simple_prompt_model_grader_2, std_metrics_o3_medium_simple_prompt_model_grader_2 = average_and_std_metrics(metrics_o3_medium_simple_prompt_model_grader_2)\n",
+ "avg_metrics_ftmodel_medium_simple_prompt_model_grader_2, std_metrics_ftmodel_medium_simple_prompt_model_grader_2 = average_and_std_metrics(metrics_ftmodel_medium_simple_prompt_model_grader_2)\n",
+ "model_metrics_avg = {\n",
+ " \"o4-mini-medium-simple-prompt\": avg_metrics_o4mini_medium_simple_prompt_model_grader_2,\n",
+ " \"o3-medium-simple-prompt\": avg_metrics_o3_medium_simple_prompt_model_grader_2,\n",
+ " \"ftmodel-medium-simple-prompt\": avg_metrics_ftmodel_medium_simple_prompt_model_grader_2\n",
+ "}\n",
+ "model_metrics_std = {\n",
+ " \"o4-mini-medium-simple-prompt\": std_metrics_o4mini_medium_simple_prompt_model_grader_2,\n",
+ " \"o3-medium-simple-prompt\": std_metrics_o3_medium_simple_prompt_model_grader_2,\n",
+ " \"ftmodel-medium-simple-prompt\": std_metrics_ftmodel_medium_simple_prompt_model_grader_2\n",
+ "}\n",
+ "plot_model_accuracies(model_metrics_avg, model_metrics_std, grader_title=\"Model Grader 2 Accuracy\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 36,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n",
+ "Total mistakes: 80\n",
+ "\n",
+ "[Sample 5]\n",
+ " Model prediction: carotid duplex ultrasound\n",
+ " Reference answer: carotid doppler\n",
+ " Score: 0.5525\n",
+ "\n",
+ "[Sample 6]\n",
+ " Model prediction: under fixation due to insufficient fixation time\n",
+ " Reference answer: incomplete fixation\n",
+ " Score: 0.5037037037037037\n",
+ "\n",
+ "[Sample 7]\n",
+ " Model prediction: acute rheumatic fever due to group a streptococcal pharyngitis mediated by type ii hypersensitivity\n",
+ " Reference answer: acute rheumatic fever\n",
+ " Score: 0.85\n",
+ "\n",
+ "[Sample 8]\n",
+ " Model prediction: exposure (open) method of burn treatment\n",
+ " Reference answer: heterograft application with sutures to secure it in place and daily washes, but no dressing\n",
+ " Score: 0.3031007751937985\n",
+ "\n",
+ "[Sample 9]\n",
+ " Model prediction: beta-lactamase production leading to enzymatic inactivation of ampicillin\n",
+ " Reference answer: production of beta-lactamase enzyme\n",
+ " Score: 0.7555555555555555\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Print mistakes where the model did not get the correct answer (score < 1.0)\n",
+ "mistakes = [\n",
+ " {\"index\": i, **res}\n",
+ " for i, res in enumerate(predictions_ftmodel_medium_simple_prompt_model_grader_2[0])\n",
+ " if res[\"score\"] < 1.0\n",
+ "]\n",
+ "\n",
+ "print(f\"\\nTotal mistakes: {len(mistakes)}\")\n",
+ "for m in mistakes[5:10]:\n",
+ " print(f\"\\n[Sample {m['index']}]\")\n",
+ " print(f\" Model prediction: {m['model_prediction']}\")\n",
+ " print(f\" Reference answer: {m['reference_answer']}\")\n",
+ " print(f\" Score: {m['score']}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We see about a 5-point boost in accuracy after fine-tuning. Looking at the first few errors, the model tends to harshly penalize answers that are close but not clinically identical-like *carotid duplex ultrasound* vs. *carotid doppler*. It also dings longer answers, even when they’re correct, like *beta-lactamase production leading to enzymatic inactivation of ampicillin*."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 50,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "o4-mini-medium-simple-prompt bin counts: [ 4. 15. 9. 7. 7. 4. 3. 5. 22. 24.]\n",
+ "ftmodel-medium-simple-prompt bin counts: [ 8. 15. 7. 3. 9. 7. 8. 4. 19. 20.]\n",
+ "Max bin count (y-axis): 24.0\n"
+ ]
+ },
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "scores_o4 = [p['score'] for p in predictions_o4mini_medium_simple_prompt_model_grader_2[0]]\n",
+ "scores_ft = [p['score'] for p in predictions_ftmodel_medium_simple_prompt_model_grader_2[0]]\n",
+ "\n",
+ "# Determine common bins for both histograms\n",
+ "all_scores = scores_o4 + scores_ft\n",
+ "bins = plt.hist(all_scores, bins=10, alpha=0)[1]\n",
+ "\n",
+ "# Plot histograms and capture the counts\n",
+ "counts_o4, _, _ = plt.hist(\n",
+ " scores_o4,\n",
+ " bins=bins,\n",
+ " alpha=0.6,\n",
+ " label='o4-mini-medium-simple-prompt'\n",
+ ")\n",
+ "counts_ft, _, _ = plt.hist(\n",
+ " scores_ft,\n",
+ " bins=bins,\n",
+ " alpha=0.6,\n",
+ " label='ftmodel-medium-simple-prompt'\n",
+ ")\n",
+ "\n",
+ "plt.title(\"Model Grader 2 Score Distribution by Model\")\n",
+ "plt.xlabel(\"Score\")\n",
+ "plt.ylabel(\"Count\")\n",
+ "plt.ylim(top=25)\n",
+ "plt.legend()\n",
+ "\n",
+ "# Print the bin counts\n",
+ "print(\"o4-mini-medium-simple-prompt bin counts:\", counts_o4)\n",
+ "print(\"ftmodel-medium-simple-prompt bin counts:\", counts_ft)\n",
+ "print(\"Max bin count (y-axis):\", max(max(counts_o4), max(counts_ft)))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Looking at the distruibution of scores, we observe that RFT helped shift the model’s predictions out of the mid-to-low score zone (0.4–0.5) and into the mid-to-high range (0.5–0.6). Since the grader emphasizes clinical similarity over lexical match, this shift reflects stronger medical reasoning-not just better phrasing-according to our *expert* grader. As observed in the 0.9-1.0 range, some verbosity crept in despite mitigations and slightly lowering scores throughout, though it often reflected more complete, semantically aligned answers. A future grader pass could better account for these cases.\n",
+ "\n",
+ "Note, because the earlier `combined_grader` was designed to reward lexical correctness, its accuracy didnʼt improve much-which is expected. That gap reinforces why validating your model grader is critical, and why you should monitor for reward-hacking. In our case, we used `o3` to spot-check grading behavior, but domain expert review is essential. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Model's reasoning\n",
+ "\n",
+ "Another important point in the analysis of the fine-tuned model are the reasoning summaries. The model may provide key information throughout these summaries, and exploring them to understand where the model fails can drive updates in the model's and the grader's system prompts. Below, we show examples of such chain of thought summaries that the model produced to show its way of answering the question:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 118,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Mean reasoning_tokens_used o4-mini: 424\n",
+ "Mean reasoning_tokens_used o3: 353\n",
+ "Mean reasoning_tokens_used ftmodel: 1820\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Flatten the list of lists into a single list of dicts\n",
+ "predictions = {\n",
+ " \"o4-mini\": predictions_o4mini_medium_simple_prompt_model_grader_2,\n",
+ " \"o3\": predictions_o3_medium_simple_prompt_model_grader_2,\n",
+ " \"ftmodel\": predictions_ftmodel_medium_simple_prompt_model_grader_2,\n",
+ "}\n",
+ "\n",
+ "for model_name, predictions in predictions.items():\n",
+ " all_preds = [item for sublist in predictions for item in sublist]\n",
+ " reasoning_tokens = [p['reasoning_tokens_used'] for p in all_preds if 'reasoning_tokens_used' in p]\n",
+ " mean_reasoning_tokens = np.mean(reasoning_tokens)\n",
+ " print(f\"Mean reasoning_tokens_used {model_name}: {mean_reasoning_tokens:.0f}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The fine-tuned model spends more reasoning tokens to think through the question. Let's visualize an example thanks to the reasoning summaries."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/markdown": [
+ "**Classifying staging type**\n",
+ "\n",
+ "The user provided a clinical scenario of a 35-year-old female with a 5 cm oral tumor and a 2 cm lymph node. They're asking how to stage it according to the TNM classification. This is a diagnosis query, so the correct answer type here is \"diagnosis.\" Considering the tumor's size, it appears to be classified as T3 since it's greater than 4 cm. Thus, I think the staging might be Stage II, but I'll confirm that."
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "from IPython.display import Markdown, display\n",
+ "markdown_text = results_o4mini_model_grader_2[5][\"summaries\"]\n",
+ "display(Markdown(markdown_text))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/markdown": [
+ "**Clarifying T staging for cancers**\n",
+ "\n",
+ "I’m digging into T staging for head and neck cancers in the oral cavity. So, T1 applies to tumors 2 cm or less, T2 for those over 2 cm but not more than 4 cm, and T3 is for tumors over 4 cm. T4a indicates invasion into adjacent structures. The patient's tumor measures 5 cm, which is over 4 cm. I’m not sure if it fits T3 or T4a, since T4a involves additional invasiveness, not just size.\n",
+ "**Determining T and N staging**\n",
+ "\n",
+ "I’m looking at a 5 cm tumor in the oral cavity. It seems there’s no mention of invasion into adjacent structures, so I’m categorizing it as T3 due to its size. T4a usually means invasion into structures like bone or skin. According to the TNM classification, since I see no such invasion, T classification remains T3.\n",
+ "\n",
+ "Moving on to N staging, I see there's a single lymph node of 2 cm on the same side; this fits the N1 classification for metastasis, as it’s less than 3 cm."
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "markdown_text = results_ft_model_grader_2[5][\"summaries\"]\n",
+ "display(Markdown(markdown_text))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Base `o4-mini`'s reasoning gives a quick answer but doesn’t explain how it got there. It mentions the tumor size but doesn’t walk through the actual TNM rules, and it seems unsure about the result. On the other hand, the `finetuned model` is more thoughtful - breaking down the T and N staging step by step and explaining why each part applies. The latter seems more careful, and seems to have learnt to break down the case description even more."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### To push the scores further\n",
+ "Both the baseline `o3` and our fine-tuned `o4-mini` sometimes scored zero on the same samples-a red flag that the reference labels may be wrong. Before adding more compute, invest in data quality: have a domain expert relabel the noisy slice, analyze the model's reasoning, then tighten the grader prompt. Clean, trusted data and methodical updates almost always buys more accuracy than extra epochs."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## **Conclusion**\n",
+ "\n",
+ "Weʼve looked at how to design graders that give `o4-mini` the kind of detailed feedback it needs during RFT. That signal is what helps the model actually learn and improve beyond the baseline. Model graders can be incredibly powerful for this-but only if theyʼre designed carefully. A sloppy grader or sloppy data can send the wrong signals and steer the model in the wrong direction. \n",
+ "\n",
+ "You're now ready to apply reinforcement fine-tuning on your own models using the OpenAI API. Weʼre excited to see how you push the boundaries of reasoning and tool use with custom graders and smarter model behavior!\n",
+ "\n",
+ "For troubleshooting or next steps, refer to the [OpenAI fine-tuning documentation](https://platform.openai.com/docs/guides/fine-tuning)."
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "jupyter-env",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.11.8"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/images/rft_dashboard_modelgrader2.png b/images/rft_dashboard_modelgrader2.png
new file mode 100644
index 0000000000..731c38c9af
Binary files /dev/null and b/images/rft_dashboard_modelgrader2.png differ
diff --git a/images/rft_hacking.png b/images/rft_hacking.png
new file mode 100644
index 0000000000..2c56c089bc
Binary files /dev/null and b/images/rft_hacking.png differ
diff --git a/images/rft_string_grader.png b/images/rft_string_grader.png
new file mode 100644
index 0000000000..0352efe15c
Binary files /dev/null and b/images/rft_string_grader.png differ
diff --git a/registry.yaml b/registry.yaml
index ce8734671b..3026254ae6 100644
--- a/registry.yaml
+++ b/registry.yaml
@@ -4,6 +4,16 @@
# should build pages for, and indicates metadata such as tags, creation date and
# authors for each page.
+- title: Exploring Model Graders for Reinforcement Fine-Tuning
+ path: examples/Reinforcement_Fine_Tuning.ipynb
+ date: 2025-05-23
+ authors:
+ - theophile-openai
+ tags:
+ - reinforcement-learning
+ - fine-tuning
+ - reinforcement-learning-graders
+
- title: Reinforcement Fine-tuning with the OpenAI API
path: examples/fine-tuned_qa/reinforcement_finetuning_healthbench.ipynb
date: 2025-05-21