openai
diff --git a/‎README.md
Lines changed: 110 additions & 32 deletions b/‎README.md
Lines changed: 110 additions & 32 deletions
diff --git a/‎public/screenshot_chat_supervisor.png
1.21 MB b/‎public/screenshot_chat_supervisor.png
1.21 MB
diff --git a/‎public/screenshot.png renamed to ‎public/screenshot_handoff.png b/‎public/screenshot.png renamed to ‎public/screenshot_handoff.png
@@ -1,23 +1,76 @@
 # Realtime API Agents Demo
 
-This is a simple demonstration of more advanced, agentic patterns built on top of the Realtime API. In particular, this demonstrates:
-- Sequential agent handoffs according to a defined agent graph (taking inspiration from [OpenAI Swarm](https://github.com/openai/swarm))
-- Background escalation to more intelligent models like o4-mini for high-stakes decisions
-- Prompting models to follow a state machine, for example to accurately collect things like names and phone numbers with confirmation character by character to authenticate a user.
-
-Here's a quick [demo video](https://x.com/OpenAIDevs/status/1880306081517432936) if you'd like a walkthrough. You should be able to use this repo to prototype your own multi-agent realtime voice app in less than 20 minutes!
-
-![Screenshot of the Realtime API Agents Demo](/public/screenshot.png)
+This is a demonstration of more advanced patterns for voice agents, using the OpenAI Realtime API. There are two main patterns demonstrated:
+1. **Chat-Supervisor:** A realtime-based chat agent interacts with the user and handles basic tasks, while a more intelligent, text-based supervisor model (e.g., `gpt-4.1`) is used extensively for tool calls and more complex responses. This approach provides an easy onramp and high-quality answers, with a small increase in latency.
+2. **Sequential Handoff:** Specialized agents (powered by realtime api) transfer the user between them to handle specific user intents. This is great for customer service, where user intents can be handled sequentially by specialist models that excel in a specific domains. This helps avoid the model having all instructions and tools in a single agent, which can degrade performance.
 
 ## Setup
 
-- This is a Next.js typescript app
-- Install dependencies with `npm i`
-- Add your `OPENAI_API_KEY` to your env. Either add it to your `.bash_profile` or equivalent file, or copy `.env.sample` to `.env` and add it there.
+- This is a Next.js typescript app. Install dependencies with `npm i`.
+- Add your `OPENAI_API_KEY` to your env. Either add it to your `.bash_profile` or equivalent, or copy `.env.sample` to `.env` and add it there.
 - Start the server with `npm run dev`
-- Open your browser to [http://localhost:3000](http://localhost:3000) to see the app. It should automatically connect to the `simpleExample` Agent Set.
+- Open your browser to [http://localhost:3000](http://localhost:3000). It should default to the `chatSupervisor` Agent Config.
+- You can change examples via the "Scenario" dropdown in the top right.
+
+# 1. Chat-Supervisor
+
+This is demonstrated in the [chatSupervisor](src/app/agentConfigs/chatSupervisor/index.ts) Agent Config. The chat agent uses the realtime model to converse with the user and handle basic tasks, like greeting the user, casual conversation, and collecting information, and a more intelligent, text-based supervisor model (e.g. `gpt-4.1`) is used extensively to handle tool calls and more challenging responses. You can control the decision boundary by "opting in" specific tasks to the chat agent as desired.
+
+## Example
+![Screenshot of the Chat Supervisor Flow](/public/screenshot_chat_supervisor.png)
+*In this screenshot, note the immediate response to collect the phone number, and the deferral to the supervisor agent to handle the tool call and formulate the response. Note that there was only ~2s between the end of "give me a moment to check on that." and the start of the "Thanks for waiting. Your last bill..." message.*
+
+## Schematic
+```mermaid
+sequenceDiagram
+    participant User
+    participant ChatAgent as Chat Agent<br/>(gpt-4o-realtime-mini)
+    participant Supervisor as Supervisor Agent<br/>(gpt-4.1)
+    participant Tool as Tool
+
+    alt Basic chat or info collection
+        User->>ChatAgent: User message
+        ChatAgent->>User: Responds directly
+    else Requires higher intelligence and/or tool call
+        User->>ChatAgent: User message
+        ChatAgent->>User: "Let me think"
+        ChatAgent->>Supervisor: Forwards message/context
+        alt Tool call needed
+            Supervisor->>Tool: Calls tool
+            Tool->>Supervisor: Returns result
+        end
+        Supervisor->>ChatAgent: Returns response
+        ChatAgent->>User: Delivers response
+    end
+```
+
+## Benefits
+- **Simpler onboarding.** If you already have a performant text-based chat agent, you can give that same prompt and set of tools to the supervisor agent, and make some tweaks to the chat agent prompt, you'll have a natural voice agent that will perform on par with your text agent.
+- **Simple ramp to a full realtime agent**: Rather than switching your whole agent to the realtime api, you can move one task at a time, taking time to validate and build trust for each before deploying to production.
+- **High intelligence**: You benefit from the high intelligence, excellent tool calling and instruction following of models like `gpt-4.1` in your voice agents.
+- **Lower cost**: If your chat agent is only being used for basic tasks, you can use the realtime-mini model, which, even when combined with GPT-4.1, should be cheaper than using the full 4o-realtime model.
+- **User experience**: It's a more natural conversational experience than using a stitched model architecture, where response latency is often 1.5s or longer after a user has finished speaking. In this architecture, the model responds to the user right away, even if it has to lean on the supervisor agent.
+  - However, more assistant responses will start with "Let me think", rather than responding immediately with the full response.
+
+## Modifying for your own agent
+1. Update [supervisorAgent](src/app/agentConfigs/chatSupervisorDemo/supervisorAgent.ts).
+  a. Add your existing text agent prompt and tools if you already have them. This should contain the "meat" of your voice agent logic and be very specific with what it should/shouldn't do and how exactly it should respond. Add this information below `==== Domain-Specific Agent Instructions ====`.
+  b. You should likely update this prompt to be more appropriate for voice, for example with instructions to be concise and avoiding long lists of items.
+2. Update [chatAgent](src/app/agentConfigs/chatSupervisor/index.ts).
+  a. Customize the chatAgent instructions with your own tone, greeting, etc.
+  b. Add your tool definitions to `chatAgentInstructions`. We recommend a brief yaml description rather than json to ensure the model doesn't get confused and try calling the tool directly.
+  c. You can modify the decision boundary by adding new items to the `# Allow List of Permitted Actions` section.
+3. To reduce cost, try using `gpt-4o-mini-realtime` for the chatAgent and/or `gpt-4.1-mini` for the supervisor model. To maximize intelligence on particularly difficult or high-stakes tasks, consider trading off latency and adding chain-of-thought to your supervisor prompt, or using an additional reasoning model-based supervisor that uses `o4-mini`.
+
+# 2. Sequential Handoffs
+
+This pattern is inspired by [OpenAI Swarm](https://github.com/openai/swarm) and involves the sequential handoff of a user between specialized agents. Handoffs are decided by the model and coordinated via tool calls, and possible handoffs are defined explicitly in an agent graph. A handoff triggers a session.update event with new instructions and tools. This pattern is effective for handling a variety of user intents with specialist agents, each of which might have long instructions and numerous tools.
+
+Here's a [video walkthrough](https://x.com/OpenAIDevs/status/1880306081517432936) showing how it works. You should be able to use this repo to prototype your own multi-agent realtime voice app in less than 20 minutes!
+
+![Screenshot of the Realtime API Agents Demo](/public/screenshot_handoff.png)
+*In this simple example, the user is transferred from a greeter agent to a haiku agent. See below for the simple, full configuration of this flow.*
 
-## Configuring Agents
 Configuration in `src/app/agentConfigs/simpleExample.ts`
 ```javascript
 import { AgentConfig } from "@/app/types";
@@ -46,12 +99,40 @@ const agents = injectTransferTools([greeter, haikuWriter]);
 
 export default agents;
 ```
+## CustomerServiceRetail Flow
 
-This fully specifies the agent set that was used in the interaction shown in the screenshot above.
+This is a more complex, representative implementation that illustrates a customer service flow, with the following features:
+- A more complex agent graph with agents for user authentication, returns, sales, and a placeholder human agent for escalations.
+- An escalation by the [returns](https://github.com/openai/openai-realtime-agents/blob/60f4effc50a539b19b2f1fa4c38846086b58c295/src/app/agentConfigs/customerServiceRetail/returns.ts#L233) agent to `o4-mini` to validate and initiate a return, as an example high-stakes decision, using a similar pattern to the above.
+- Prompting models to follow a state machine, for example to accurately collect things like names and phone numbers with confirmation character by character to authenticate a user.
+  - To test this flow, say that you'd like to return your snowboard and go through the necessary prompts!
 
-### Sequence Diagram of CustomerServiceRetail Flow
+Configuration in [src/app/agentConfigs/customerServiceRetail/index.ts](src/app/agentConfigs/customerServiceRetail/index.ts).
+```javascript
+import authentication from "./authentication";
+import returns from "./returns";
+import sales from "./sales";
+import simulatedHuman from "./simulatedHuman";
+import { injectTransferTools } from "../utils";
+
+authentication.downstreamAgents = [returns, sales, simulatedHuman];
+returns.downstreamAgents = [authentication, sales, simulatedHuman];
+sales.downstreamAgents = [authentication, returns, simulatedHuman];
+simulatedHuman.downstreamAgents = [authentication, returns, sales];
+
+const agents = injectTransferTools([
+  authentication,
+  returns,
+  sales,
+  simulatedHuman,
+]);
 
-This diagram illustrates the interaction flow defined in `src/app/agentConfigs/customerServiceRetail/`.
+export default agents;
+```
+
+## Schematic
+
+This diagram illustrates a more advanced interaction flow defined in `src/app/agentConfigs/customerServiceRetail/`.
 
 <details>
 <summary><strong>Show CustomerServiceRetail Flow Diagram</strong></summary>
@@ -108,28 +189,25 @@ sequenceDiagram
 
 </details>
 
+# Other Info
+## Next Steps
+- You can copy these templates to make your own multi-agent voice app! Once you make a new agent set config, add it to `src/app/agentConfigs/index.ts` and you should be able to select it in the UI in the "Scenario" dropdown menu.
+- Each agentConfig can define instructions, tools, and toolLogic. By default all tool calls simply return `True`, unless you define the toolLogic, which will run your specific tool logic and return an object to the conversation (e.g. for retrieved RAG context).
+- If you want help creating your own prompt using the conventions shown in customerServiceRetail, including defining a state machine, we've included a metaprompt [here](src/app/agentConfigs/voiceAgentMetaprompt.txt), or you can use our [Voice Agent Metaprompter GPT](https://chatgpt.com/g/g-678865c9fb5c81918fa28699735dd08e-voice-agent-metaprompt-gpt)
 
-### Next steps
-- Check out the configs in `src/app/agentConfigs`. The example above is a minimal demo that illustrates the core concepts.
-- [frontDeskAuthentication](src/app/agentConfigs/frontDeskAuthentication) Guides the user through a step-by-step authentication flow, confirming each value character-by-character, authenticates the user with a tool call, and then transfers to another agent. Note that the second agent is intentionally "bored" to show how to prompt for personality and tone.
-- [customerServiceRetail](src/app/agentConfigs/customerServiceRetail) Also guides through an authentication flow, reads a long offer from a canned script verbatim, and then walks through a complex return flow which requires looking up orders and policies, gathering user context, and checking with `o4-mini` to ensure the return is eligible. To test this flow, say that you'd like to return your snowboard and go through the necessary prompts!
-
-### Defining your own agents
-- You can copy these to make your own multi-agent voice app! Once you make a new agent set config, add it to `src/app/agentConfigs/index.ts` and you should be able to select it in the UI in the "Scenario" dropdown menu.
-- To see how to define tools and toolLogic, including a background LLM call, see [src/app/agentConfigs/customerServiceRetail/returns.ts](src/app/agentConfigs/customerServiceRetail/returns.ts)
-- To see how to define a detailed personality and tone, and use a prompt state machine to collect user information step by step, see [src/app/agentConfigs/frontDeskAuthentication/authentication.ts](src/app/agentConfigs/frontDeskAuthentication/authentication.ts)
-- To see how to wire up Agents into a single Agent Set, see [src/app/agentConfigs/frontDeskAuthentication/index.ts](src/app/agentConfigs/frontDeskAuthentication/index.ts)
-- If you want help creating your own prompt using these conventions, we've included a metaprompt [here](src/app/agentConfigs/voiceAgentMetaprompt.txt), or you can use our [Voice Agent Metaprompter GPT](https://chatgpt.com/g/g-678865c9fb5c81918fa28699735dd08e-voice-agent-metaprompt-gpt)
+## Output Guardrails
+Assistant messages are checked for safety and compliance using a guardrail function before being finalized in the transcript. This is implemented in [`src/app/hooks/useHandleServerEvent.ts`](src/app/hooks/useHandleServerEvent.ts) as the `processGuardrail` function, which is invoked on each assistant message (after every 5 incremental words received) to run a moderation/classification check. You can review or customize this logic by editing the `processGuardrail` function definition and its invocation inside `useHandleServerEvent`.
 
-### Customizing Output Guardrails
-Assistant messages are checked for safety and compliance using a guardrail function before being finalized in the transcript. This is implemented in [`src/app/hooks/useHandleServerEvent.ts`](src/app/hooks/useHandleServerEvent.ts) as the `processGuardrail` function, which is invoked on each assistant message to run a moderation/classification check. You can review or customize this logic by editing the `processGuardrail` function definition and its invocation inside `useHandleServerEvent`.
-
-## UI
+## Navigating the UI
 - You can select agent scenarios in the Scenario dropdown, and automatically switch to a specific agent with the Agent dropdown.
 - The conversation transcript is on the left, including tool calls, tool call responses, and agent changes. Click to expand non-message elements.
 - The event log is on the right, showing both client and server events. Click to see the full payload.
 - On the bottom, you can disconnect, toggle between automated voice-activity detection or PTT, turn off audio playback, and toggle logs.
 
-## Core Contributors
+## Pull Requests
+
+Feel free to open an issue or pull request and we'll do our best to review it. The spirit of this repo is to demonstrate the core logic for new agentic flows; PRs that go beyond this core scope will likely not be merged.
+
+# Core Contributors
 - Noah MacCallum - [noahmacca](https://x.com/noahmacca)
 - Ilan Bigio - [ibigio](https://github.com/ibigio)