diff --git a/examples/partners/eval_driven_system_design/receipt_inspection.ipynb b/examples/partners/eval_driven_system_design/receipt_inspection.ipynb index 6a8753ba92..b9688bf8c0 100644 --- a/examples/partners/eval_driven_system_design/receipt_inspection.ipynb +++ b/examples/partners/eval_driven_system_design/receipt_inspection.ipynb @@ -112,6 +112,15 @@ "source": [ "## Project Lifecycle\n", "\n", + "Not every project will proceed in the same way, but projects generally have some \n", + "important components in common.\n", + "\n", + "![Project Lifecycle](../../../images/partner_project_lifecycle.png)\n", + "\n", + "The solid arrows show the primary progressions or steps, while the dotted line \n", + "represents the ongoing nature of problem understanding - uncovering more about\n", + "the customer domain will influence every step of the process. We wil examine \n", + "several of these iterative cycles of refinement in detail below. \n", "Not every project will proceed in the same way, but projects generally have some common\n", "important components.\n", "\n", @@ -133,6 +142,11 @@ "It's very rare that a real-world project will start with all the data necessary to get\n", "to a satisfactory solution, much less to establish confidence.\n", "\n", + "In our case, we're going to assume that we have a decent sample of system *inputs*, \n", + "in the form of but receipt images, but start without any fully annotated data. We find \n", + "this is a not-unusual situation when automating an existing process. Instead, \n", + "we'll walk through the process of building that out as we go along by collaborating with\n", + "domain experts, and make our evals progressively more comprehensive.\n", "In our case, we're going to assume that we have a decent sample of system *inputs*\n", "(here, photographs of receipts), but start without any fully annotated data. We'll walk\n", "through the process of incrementally expanding our test and training sets as we go along\n", @@ -498,6 +512,21 @@ "### Action Decision\n", "\n", "Next, we need to close the loop and get to an actual decision based on receipts. This\n", + "looks pretty similar, so we'll present the code without comment.\n", + "\n", + "Ordinarily one would start with the most capable model - `o3`, at this time - for a \n", + "first pass, and then once correctness is established experiment with different models\n", + "to analyze any tradeoffs for their business impact, and potentially consider whether \n", + "they are remediable with iteration. A client may be willing to take a certain accuracy \n", + "hit for lower latency or cost, or it may be more effective to change the architecture\n", + "to hit cost, latency, and accuracy goals. We'll get into how to make these tradeoffs\n", + "explicitly and objectively later on. \n", + "\n", + "For this cookbook, `o3` might be too good. We'll use `o4-mini` for our first pass, so \n", + "that we get a few reasoning errors we can use to illustrate the means of addressing\n", + "them when they occur.\n", + "\n", + "Next, we need to close the loop and get to an actual decision based on receipts. This\n", "looks pretty similar, so we'll present the code without comment." ] }, @@ -887,6 +916,10 @@ "metadata": {}, "source": [ "After you run that eval you'll be able to view it in the UI, and should see something\n", + "like the below. \n", + "\n", + "(Note, if you have a Zero-Data-Retention agreement, this data is not stored\n", + "by OpenAI, so will not be available in this interface.)\n", "like:\n", "\n", "![Summary UI](../../../images/partner_summary_ui.png)\n", @@ -1617,6 +1650,7 @@ "ARE NOT TRAVEL-RELATED, THEN IT MUST BE AUDITED.\n", "```\n", "\n", + "4. We added three examples, JSON input/output pairs wrapped in XML tags.\n", "3. We added three examples, JSON input/output pairs wrapped in XML tags.\n", "\n", "With our prompt revisions, we'll regenerate the data to evaluate and re-run the same\n", diff --git a/images/partner_development_flywheel.png b/images/partner_development_flywheel.png index 2249ee4a64..5ea3c017de 100644 Binary files a/images/partner_development_flywheel.png and b/images/partner_development_flywheel.png differ diff --git a/images/partner_model_improvement_waterfall.png b/images/partner_model_improvement_waterfall.png index ddd126b9f2..0be43831f0 100644 Binary files a/images/partner_model_improvement_waterfall.png and b/images/partner_model_improvement_waterfall.png differ diff --git a/images/partner_process_flowchart.png b/images/partner_process_flowchart.png index c6090e7092..f6bc53d5af 100644 Binary files a/images/partner_process_flowchart.png and b/images/partner_process_flowchart.png differ diff --git a/images/partner_project_lifecycle.png b/images/partner_project_lifecycle.png new file mode 100644 index 0000000000..2ea16d93b1 Binary files /dev/null and b/images/partner_project_lifecycle.png differ diff --git a/registry.yaml b/registry.yaml index 227d2b3b85..8e7102004b 100644 --- a/registry.yaml +++ b/registry.yaml @@ -9,8 +9,13 @@ date: 2025-06-01 authors: - shikhar-cyber + - moredatarequired + - tooluser + - eddiesiegel tags: - evals + - API Flywheel + - completions - responses - functions - tracing