Rendering updates for the RFT with model grader cookbook (#1862)

theophile-oai · web-flow · commit 125f9cd81853 · 2025-05-27T13:58:34.000-07:00
diff --git a/examples/Reinforcement_Fine_Tuning.ipynb b/examples/Reinforcement_Fine_Tuning.ipynb
@@ -689,7 +689,7 @@
     "\n",
     "We can visualize the full score distribution on the training set.\n",
     "\n",
-    "> **Note:** : In practice, analyzing model errors at scale often involves a mix of manual review and automated methods-like tagging failure types or clustering predictions by score and content. That workflow is beyond the scope of this guide, but it's a valuable next step once you've identified broad patterns."
+    "> Note: In practice, analyzing model errors at scale often involves a mix of manual review and automated methods-like tagging failure types or clustering predictions by score and content. That workflow is beyond the scope of this guide, but it's a valuable next step once you've identified broad patterns."
    ]
   },
   {
@@ -1968,7 +1968,7 @@
    "source": [
     "Looking at the distruibution of scores, we observe that RFT helped shift the model’s predictions out of the mid-to-low score zone (0.4–0.5) and into the mid-to-high range (0.5–0.6). Since the grader emphasizes clinical similarity over lexical match, this shift reflects stronger medical reasoning-not just better phrasing-according to our *expert* grader. As observed in the 0.9-1.0 range, some verbosity crept in despite mitigations and slightly lowering scores throughout, though it often reflected more complete, semantically aligned answers. A future grader pass could better account for these cases.\n",
     "\n",
-    "Note, because the earlier `combined_grader` was designed to reward lexical correctness, its accuracy didnʼt improve much-which is expected. That gap reinforces why validating your model grader is critical, and why you should monitor for reward-hacking. In our case, we used `o3` to spot-check grading behavior, but domain expert review is essential. "
+    "Note that, because the earlier `combined_grader` was designed to reward lexical correctness, its accuracy didnʼt improve much-which is expected. That gap reinforces why validating your model grader is critical, and why you should monitor for reward-hacking. In our case, we used `o3` to spot-check grading behavior, but domain expert review is essential. "
    ]
   },
   {
@@ -2019,22 +2019,17 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 1,
    "metadata": {},
    "outputs": [
     {
-     "data": {
-      "text/markdown": [
-       "**Classifying staging type**\n",
-       "\n",
-       "The user provided a clinical scenario of a 35-year-old female with a 5 cm oral tumor and a 2 cm lymph node. They're asking how to stage it according to the TNM classification. This is a diagnosis query, so the correct answer type here is \"diagnosis.\" Considering the tumor's size, it appears to be classified as T3 since it's greater than 4 cm. Thus, I think the staging might be Stage II, but I'll confirm that."
-      ],
-      "text/plain": [
-       "<IPython.core.display.Markdown object>"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Classifying staging type\n",
+      "\n",
+      "The user provided a clinical scenario of a 35-year-old female with a 5 cm oral tumor and a 2 cm lymph node. They're asking how to stage it according to the TNM classification. This is a diagnosis query, so the correct answer type here is \"diagnosis.\" Considering the tumor's size, it appears to be classified as T3 since it's greater than 4 cm. Thus, I think the staging might be Stage II, but I'll confirm that.\n"
+     ]
     }
    ],
    "source": [
@@ -2045,27 +2040,21 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 2,
    "metadata": {},
    "outputs": [
     {
-     "data": {
-      "text/markdown": [
-       "**Clarifying T staging for cancers**\n",
-       "\n",
-       "I’m digging into T staging for head and neck cancers in the oral cavity. So, T1 applies to tumors 2 cm or less, T2 for those over 2 cm but not more than 4 cm, and T3 is for tumors over 4 cm. T4a indicates invasion into adjacent structures. The patient's tumor measures 5 cm, which is over 4 cm. I’m not sure if it fits T3 or T4a, since T4a involves additional invasiveness, not just size.\n",
-       "**Determining T and N staging**\n",
-       "\n",
-       "I’m looking at a 5 cm tumor in the oral cavity. It seems there’s no mention of invasion into adjacent structures, so I’m categorizing it as T3 due to its size. T4a usually means invasion into structures like bone or skin. According to the TNM classification, since I see no such invasion, T classification remains T3.\n",
-       "\n",
-       "Moving on to N staging, I see there's a single lymph node of 2 cm on the same side; this fits the N1 classification for metastasis, as it’s less than 3 cm."
-      ],
-      "text/plain": [
-       "<IPython.core.display.Markdown object>"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Clarifying T staging for cancers\n",
+      "\n",
+      "I’m digging into T staging for head and neck cancers in the oral cavity. So, T1 applies to tumors 2 cm or less, T2 for those over 2 cm but not more than 4 cm, and T3 is for tumors over 4 cm. T4a indicates invasion into adjacent structures. The patient's tumor measures 5 cm, which is over 4 cm. I’m not sure if it fits T3 or T4a, since T4a involves additional invasiveness, not just size. Determining T and N staging\n",
+      "\n",
+      "I’m looking at a 5 cm tumor in the oral cavity. It seems there’s no mention of invasion into adjacent structures, so I’m categorizing it as T3 due to its size. T4a usually means invasion into structures like bone or skin. According to the TNM classification, since I see no such invasion, T classification remains T3.\n",
+      "\n",
+      "Moving on to N staging, I see there's a single lymph node of 2 cm on the same side; this fits the N1 classification for metastasis, as it’s less than 3 cm.\n"
+     ]
     }
    ],
    "source": [
diff --git a/registry.yaml b/registry.yaml
@@ -19,7 +19,7 @@
   path: examples/Reinforcement_Fine_Tuning.ipynb
   date: 2025-05-23
   authors:
-    - theophile-openai
+    - theophile-oai
   tags:
     - reinforcement-learning
     - fine-tuning