Initial draft

msingh-openai · msingh-openai · commit 8b1d2fa78bcf · 2025-05-27T13:17:29.000-07:00
diff --git a/examples/agents_sdk/AI_Research_Assistant_Cookbook.ipynb b/examples/agents_sdk/AI_Research_Assistant_Cookbook.ipynb
@@ -0,0 +1,352 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "85b66af9",
+   "metadata": {},
+   "source": [
+    "# Building an **AI Research Assistant** with the OpenAI Agents SDK\n",
+    "\n",
+    "This notebook provides a reference patterns for implementing a multi‑agent AI Research Assistant that can plan, search, curate, and draft high‑quality reports with citations.\n",
+    "\n",
+    "While the Deep Research feature is avaialble in ChatGPT, however, individual and companies may want to implement their own API based solution for a more finegrained control over the output.\n",
+    "\n",
+    "With support for Agents, and built-in tools such as Code Interpreter, Web Search, and File Search, - Responses API makes building your own Research Assistant fast and easy. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0dcd3942",
+   "metadata": {},
+   "source": [
+    "## Table of Contents\n",
+    "1. [Overview](#overview)\n",
+    "2. [Solution Workflow](#workflow)\n",
+    "3. [High‑Level Architecture](#architecture)\n",
+    "4. [Agent Definitions (Pseudo Code)](#agents)\n",
+    "    * Research Planning Agent\n",
+    "    * Web Search Agent\n",
+    "    * Knowledge Assistant Agent\n",
+    "    * Report Creation Agent\n",
+    "    * Data Analysis Agent (optional)\n",
+    "    * Image‑Gen Agent (optional)\n",
+    "5. [Guardrails & Best Practices](#best-practices)\n",
+    "6. [Risks & Mitigation](#risks)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a32e358e",
+   "metadata": {},
+   "source": [
+    "### 1 — Overview <a id='overview'></a>\n",
+    "The AI Research Assistant helps drives better research quality and faster turnaround for knowledge content.\n",
+    "\n",
+    "1. **Performs autonomous Internet research** to gather the most recent sources.\n",
+    "2. **Incorporates internal data sources** such as a Company's proprietery knowledge sources. \n",
+    "3. **Reduces analyst effort from days to minutes** by automating search, curation and first‑draft writing.\n",
+    "4. **Produces draft reports with citations** and built‑in hallucination detection."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "33cb6ce3",
+   "metadata": {},
+   "source": [
+    "### 2 — Solution Workflow <a id='workflow'></a>\n",
+    "The typical workflow consists of five orchestrated steps: \n",
+    "\n",
+    "| Step | Purpose | Model |\n",
+    "|------|---------|-------|\n",
+    "| **Query Expansion** | Draft multi‑facet prompts / hypotheses | `gpt‑4o` |\n",
+    "| **Search‑Term Generation** | Expand/clean user query into rich keyword list | `gpt‑4o` |\n",
+    "| **Conduct Research** | Run web & internal searches, rank & summarise results | `gpt‑4o` + tools |\n",
+    "| **Draft Report** | Produce first narrative with reasoning & inline citations | `o1` / `gpt‑4o` |\n",
+    "| **Report Expansion** | Polish formatting, add charts / images / appendix | `gpt‑4o` + tools |"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dcb4e6dc",
+   "metadata": {},
+   "source": [
+    "### 3 — High‑Level Architecture <a id='architecture'></a>\n",
+    "The following diagram groups agents and tools:\n",
+    "\n",
+    "* **Research Planning Agent** – interprets the user request and produces a research plan/agenda.\n",
+    "* **Knowledge Assistant Agent** – orchestrates parallel web & file searches via built‑in tools, curates short‑term memory.\n",
+    "* **Web Search Agent(s)** – perform Internet queries, deduplicate, rank and summarise pages.\n",
+    "* **Report Creation Agent** – consumes curated corpus and drafts the structured report.\n",
+    "* **(Optional) Data Analysis Agent** – executes code for numeric/CSV analyses via the Code Interpreter tool.\n",
+    "* **(Optional) Image‑Gen Agent** – generates illustrative figures.\n",
+    "\n",
+    "Input/output guardrails wrap user prompts and final content for policy, safety and citation checks."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d3464739",
+   "metadata": {},
+   "source": [
+    "### 4 — Pre-requisites <a id='pre-requisites'></a>\n",
+    "\n",
+    "Create a virual environment  \n",
+    "\n",
+    "Install dependencies "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "3a16ac1f",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Note: you may need to restart the kernel to use updated packages.\n"
+     ]
+    }
+   ],
+   "source": [
+    "%pip install openai openai-agents --quiet"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "69135215",
+   "metadata": {},
+   "source": [
+    "### 5 — Agents (Pseudo Code) <a id='agents'></a>\n",
+    "Below are skeletal class definitions illustrating how each agent’s policy and tool‑usage might look."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b9f3062e",
+   "metadata": {},
+   "source": [
+    "#### Step 1 - Query Expansion"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b576089c",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Draft a comprehensive research report analyzing the evolution and impact of artificial intelligence (AI) over the past five years. This report should investigate key trends that have emerged during this period, including advancements in machine learning models like GPT and BERT, the rise of AI in industries such as healthcare, finance, and autonomous vehicles, and the ethical considerations surrounding AI development and implementation. Delve into how these trends have influenced technological growth, business strategies, and regulatory measures globally. Evaluate the societal and economic implications of these advancements and provide insights into future directions AI might take. Use a variety of sources, including scholarly articles, industry reports, and expert interviews, to support your analysis and conclusions.\n"
+     ]
+    }
+   ],
+   "source": [
+    "from agents import Agent, Runner\n",
+    "\n",
+    "query_expansion_agent = Agent(\n",
+    "    name=\"Query Expansion Agent\",\n",
+    "    instructions=\"\"\"You are a helpful agent who is given a research prompt from the user as input.  \n",
+    "        Your task is to expand the prompt into a more complete and actionable research prompt. Do not write the research \n",
+    "        paper, just improve the prompt in about one paragraph. Only respond with the expanded prompt no qualifiers.\"\"\",\n",
+    "    tools=[],\n",
+    "    model=\"gpt-4o\",    \n",
+    ")\n",
+    "\n",
+    "result = await Runner.run(query_expansion_agent, \"Draft a research report on latest trends in persona auto insurance in the US\")\n",
+    "\n",
+    "expanded_prompt = result.final_output \n",
+    "\n",
+    "print(expanded_prompt)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6b1b10e7",
+   "metadata": {},
+   "source": [
+    "#### Step 2 - Web Search Terms "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "725969cb",
+   "metadata": {},
+   "source": [
+    "Generate the web search terms. You can customize the number of search terms generated to a give level of depth. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "d3b4d4af",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Search_Queries=['Evolution of AI technology 2020-2025', 'Impact of machine learning models GPT and BERT on industries', 'AI advancements in healthcare and finance 2025', 'Ethics and AI development 2025', 'Future directions for artificial intelligence in various sectors']\n",
+      "(0, 'Evolution of AI technology 2020-2025')\n",
+      "(1, 'Impact of machine learning models GPT and BERT on industries')\n",
+      "(2, 'AI advancements in healthcare and finance 2025')\n",
+      "(3, 'Ethics and AI development 2025')\n",
+      "(4, 'Future directions for artificial intelligence in various sectors')\n"
+     ]
+    }
+   ],
+   "source": [
+    "from pydantic import BaseModel\n",
+    "\n",
+    "class SearchTerms(BaseModel):\n",
+    "    \"\"\"Structured output model for search-terms suggestions.\"\"\"\n",
+    "    Search_Queries: list[str]\n",
+    "\n",
+    "\n",
+    "search_terms_agent = Agent(\n",
+    "    name=\"Search Terms Agent\",\n",
+    "    instructions=\"\"\"You are a helpful agent assigned a research task. Your job is to provide the top \n",
+    "        5 Search Queries relevant to the given topic in this year (2025). The output should be in JSON format.\n",
+    "\n",
+    "        Example format provided below:\n",
+    "        <START OF EXAMPLE>\n",
+    "        {\n",
+    "        \"Search_Queries\": [\n",
+    "            \"Top ranked auto insurance companies US 2025 by market capitalization\",\n",
+    "            \"Geico rates and comparison with other auto insurance companies\",\n",
+    "            \"Insurance premiums of top ranked companies in the US in 2025\", \n",
+    "            \"Total cost of insuring autos in US 2025\", \n",
+    "            \"Top customer service feedback for auto insurance in 2025\"\n",
+    "        ]\n",
+    "        }\n",
+    "        </END OF EXAMPLE>\n",
+    "        \"\"\",\n",
+    "    tools=[],\n",
+    "    model=\"gpt-4o\",     \n",
+    "    output_type=SearchTerms,\n",
+    ")\n",
+    "\n",
+    "result = await Runner.run(search_terms_agent, expanded_prompt)\n",
+    "\n",
+    "search_terms_raw = result.final_output\n",
+    "\n",
+    "print(search_terms_raw)\n",
+    "\n",
+    "\n",
+    "for query in enumerate(search_terms_raw.Search_Queries):\n",
+    "    print(f\"{query}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a477b6a8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class KnowledgeAssistantAgent:\n",
+    "    \"\"\"Curates short‑term memory of research snippets.\"\"\"\n",
+    "    def run(self, web_snippets, file_snippets):\n",
+    "        corpus = web_snippets + file_snippets\n",
+    "        # Vector‑embed & cluster (pseudo)\n",
+    "        # ...\n",
+    "        # Return pruned, deduplicated corpus\n",
+    "        return corpus[:50]  # top‑N"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "de168943",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class ReportCreationAgent:\n",
+    "    \"\"\"Drafts the first complete report with citations.\"\"\"\n",
+    "    def run(self, curated_corpus, outline):\n",
+    "        report = client.chat(\n",
+    "            model='gpt-4o',\n",
+    "            system_prompt='Write a research report following the outline. Cite sources in IEEE style.',\n",
+    "            user_prompt=str({'outline': outline, 'corpus': curated_corpus}),\n",
+    "        )\n",
+    "        return report.text"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "40c6bbca",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# --- Orchestration skeleton ---\n",
+    "def generate_research_paper(topic: str):\n",
+    "    plan = ResearchPlanningAgent().run(topic)\n",
+    "\n",
+    "    web_results = WebSearchAgent().run(plan['search_terms'])\n",
+    "    # TODO: file_results via file_search if internal corpus available\n",
+    "    file_results = []\n",
+    "\n",
+    "    curated = KnowledgeAssistantAgent().run(web_results, file_results)\n",
+    "    draft = ReportCreationAgent().run(curated, plan['outline'])\n",
+    "    return draft\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fb69c797",
+   "metadata": {},
+   "source": [
+    "### 5 — Guardrails & Best Practices <a id='best-practices'></a>\n",
+    "* **Crawl → Walk → Run**: start with a single agent, then expand into a swarm. \n",
+    "* **Expose intermediate reasoning** (“show the math”) to build user trust. \n",
+    "* **Parameterise UX** so analysts can tweak report format and source mix. \n",
+    "* **Native OpenAI tools first** (web browsing, file ingestion) before reinventing low‑level retrieval. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1bdcab82",
+   "metadata": {},
+   "source": [
+    "### 6 — Risks & Mitigation <a id='risks'></a>\n",
+    "| Pitfall | Mitigation |\n",
+    "|---------|------------|\n",
+    "| Scope‑creep & endless roadmap | Narrow MVP & SMART milestones | fileciteturn1file4L23-L24 |\n",
+    "| Hallucinations & weak guardrails | Golden‑set evals, RAG with citation checks | fileciteturn1file4L25-L26 |\n",
+    "| Run‑away infra costs | Cost curve modelling; efficient models + autoscaling | fileciteturn1file4L27-L28 |\n",
+    "| Talent gaps | Upskill & leverage Agents SDK to offload core reasoning | fileciteturn1file4L29-L30 |"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5b40dcf3",
+   "metadata": {},
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.13.1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}