Skip to content

Rendering updates for the RFT with model grader cookbook #1862

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 27, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 22 additions & 33 deletions examples/Reinforcement_Fine_Tuning.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -689,7 +689,7 @@
"\n",
"We can visualize the full score distribution on the training set.\n",
"\n",
"> **Note:** : In practice, analyzing model errors at scale often involves a mix of manual review and automated methods-like tagging failure types or clustering predictions by score and content. That workflow is beyond the scope of this guide, but it's a valuable next step once you've identified broad patterns."
"> Note: In practice, analyzing model errors at scale often involves a mix of manual review and automated methods-like tagging failure types or clustering predictions by score and content. That workflow is beyond the scope of this guide, but it's a valuable next step once you've identified broad patterns."
]
},
{
Expand Down Expand Up @@ -1968,7 +1968,7 @@
"source": [
"Looking at the distruibution of scores, we observe that RFT helped shift the model’s predictions out of the mid-to-low score zone (0.4–0.5) and into the mid-to-high range (0.5–0.6). Since the grader emphasizes clinical similarity over lexical match, this shift reflects stronger medical reasoning-not just better phrasing-according to our *expert* grader. As observed in the 0.9-1.0 range, some verbosity crept in despite mitigations and slightly lowering scores throughout, though it often reflected more complete, semantically aligned answers. A future grader pass could better account for these cases.\n",
"\n",
"Note, because the earlier `combined_grader` was designed to reward lexical correctness, its accuracy didnʼt improve much-which is expected. That gap reinforces why validating your model grader is critical, and why you should monitor for reward-hacking. In our case, we used `o3` to spot-check grading behavior, but domain expert review is essential. "
"Note that, because the earlier `combined_grader` was designed to reward lexical correctness, its accuracy didnʼt improve much-which is expected. That gap reinforces why validating your model grader is critical, and why you should monitor for reward-hacking. In our case, we used `o3` to spot-check grading behavior, but domain expert review is essential. "
]
},
{
Expand Down Expand Up @@ -2019,22 +2019,17 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"**Classifying staging type**\n",
"\n",
"The user provided a clinical scenario of a 35-year-old female with a 5 cm oral tumor and a 2 cm lymph node. They're asking how to stage it according to the TNM classification. This is a diagnosis query, so the correct answer type here is \"diagnosis.\" Considering the tumor's size, it appears to be classified as T3 since it's greater than 4 cm. Thus, I think the staging might be Stage II, but I'll confirm that."
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
"name": "stdout",
"output_type": "stream",
"text": [
"Classifying staging type\n",
"\n",
"The user provided a clinical scenario of a 35-year-old female with a 5 cm oral tumor and a 2 cm lymph node. They're asking how to stage it according to the TNM classification. This is a diagnosis query, so the correct answer type here is \"diagnosis.\" Considering the tumor's size, it appears to be classified as T3 since it's greater than 4 cm. Thus, I think the staging might be Stage II, but I'll confirm that.\n"
]
}
],
"source": [
Expand All @@ -2045,27 +2040,21 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"**Clarifying T staging for cancers**\n",
"\n",
"I’m digging into T staging for head and neck cancers in the oral cavity. So, T1 applies to tumors 2 cm or less, T2 for those over 2 cm but not more than 4 cm, and T3 is for tumors over 4 cm. T4a indicates invasion into adjacent structures. The patient's tumor measures 5 cm, which is over 4 cm. I’m not sure if it fits T3 or T4a, since T4a involves additional invasiveness, not just size.\n",
"**Determining T and N staging**\n",
"\n",
"I’m looking at a 5 cm tumor in the oral cavity. It seems there’s no mention of invasion into adjacent structures, so I’m categorizing it as T3 due to its size. T4a usually means invasion into structures like bone or skin. According to the TNM classification, since I see no such invasion, T classification remains T3.\n",
"\n",
"Moving on to N staging, I see there's a single lymph node of 2 cm on the same side; this fits the N1 classification for metastasis, as it’s less than 3 cm."
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
"name": "stdout",
"output_type": "stream",
"text": [
"Clarifying T staging for cancers\n",
"\n",
"I’m digging into T staging for head and neck cancers in the oral cavity. So, T1 applies to tumors 2 cm or less, T2 for those over 2 cm but not more than 4 cm, and T3 is for tumors over 4 cm. T4a indicates invasion into adjacent structures. The patient's tumor measures 5 cm, which is over 4 cm. I’m not sure if it fits T3 or T4a, since T4a involves additional invasiveness, not just size. Determining T and N staging\n",
"\n",
"I’m looking at a 5 cm tumor in the oral cavity. It seems there’s no mention of invasion into adjacent structures, so I’m categorizing it as T3 due to its size. T4a usually means invasion into structures like bone or skin. According to the TNM classification, since I see no such invasion, T classification remains T3.\n",
"\n",
"Moving on to N staging, I see there's a single lymph node of 2 cm on the same side; this fits the N1 classification for metastasis, as it’s less than 3 cm.\n"
]
}
],
"source": [
Expand Down
2 changes: 1 addition & 1 deletion registry.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
path: examples/Reinforcement_Fine_Tuning.ipynb
date: 2025-05-23
authors:
- theophile-openai
- theophile-oai
tags:
- reinforcement-learning
- fine-tuning
Expand Down