TanStack
diff --git a/‎.changeset/audio-activity.md‎
Lines changed: 12 additions & 0 deletions b/‎.changeset/audio-activity.md‎
Lines changed: 12 additions & 0 deletions
diff --git a/‎.changeset/audio-example-pages.md‎
Lines changed: 15 additions & 0 deletions b/‎.changeset/audio-example-pages.md‎
Lines changed: 15 additions & 0 deletions
diff --git a/‎.changeset/audio-generation-hook.md‎
Lines changed: 24 additions & 0 deletions b/‎.changeset/audio-generation-hook.md‎
Lines changed: 24 additions & 0 deletions
diff --git a/‎.changeset/fal-audio-speech-transcription.md‎
Lines changed: 11 additions & 0 deletions b/‎.changeset/fal-audio-speech-transcription.md‎
Lines changed: 11 additions & 0 deletions
diff --git a/‎.changeset/gemini-audio.md‎
Lines changed: 14 additions & 0 deletions b/‎.changeset/gemini-audio.md‎
Lines changed: 14 additions & 0 deletions
diff --git a/‎.changeset/generated-media-union.md‎
Lines changed: 20 additions & 0 deletions b/‎.changeset/generated-media-union.md‎
Lines changed: 20 additions & 0 deletions
diff --git a/‎docs/adapters/fal.md‎
Lines changed: 188 additions & 3 deletions b/‎docs/adapters/fal.md‎
Lines changed: 188 additions & 3 deletions
diff --git a/‎docs/advanced/debug-logging.md‎
Lines changed: 5 additions & 1 deletion b/‎docs/advanced/debug-logging.md‎
Lines changed: 5 additions & 1 deletion
diff --git a/‎docs/config.json‎
Lines changed: 4 additions & 0 deletions b/‎docs/config.json‎
Lines changed: 4 additions & 0 deletions
@@ -0,0 +1,12 @@
+---
+'@tanstack/ai': minor
+---
+
+feat: add generateAudio activity for music and sound-effect generation
+
+Adds a new `audio` activity kind alongside the existing `tts` and `transcription` activities:
+
+- `generateAudio()` / `createAudioOptions()` functions
+- `AudioAdapter` interface and `BaseAudioAdapter` base class
+- `AudioGenerationOptions` / `AudioGenerationResult` / `GeneratedAudio` types
+- `audio:request:started`, `audio:request:completed`, and `audio:usage` devtools events
@@ -0,0 +1,15 @@
+---
+---
+
+chore: add ts-react-chat example pages and E2E coverage for audio providers
+
+**Example pages** (`examples/ts-react-chat`):
+
+- Updated text-to-speech and transcription pages with provider tabs (OpenAI, Gemini, Fal for TTS; OpenAI, Fal for transcription)
+- Added a new `/generations/audio` page covering Gemini Lyria and Fal audio generation
+- Added a shared `audio-providers` catalog and server-side adapter factories for audio, speech, and transcription
+
+**Tests**:
+
+- Added `@tanstack/ai-gemini` unit tests covering the Gemini TTS adapter (single-speaker default, multi-speaker config, missing-audio errors)
+- Added a new `audio-gen` feature to the E2E harness with a Gemini Lyria adapter factory, route, UI, fixture, and spec
@@ -0,0 +1,24 @@
+---
+'@tanstack/ai': minor
+'@tanstack/ai-client': minor
+'@tanstack/ai-react': minor
+'@tanstack/ai-solid': minor
+'@tanstack/ai-vue': minor
+'@tanstack/ai-svelte': minor
+---
+
+feat: add `useGenerateAudio` hook and streaming support for `generateAudio()`
+
+Closes the parity gap between audio generation and the other media
+activities (image, speech, video, transcription, summarize):
+
+- `generateAudio()` now accepts `stream: true`, returning an
+  `AsyncIterable<StreamChunk>` that can be piped through
+  `toServerSentEventsResponse()`.
+- `AudioGenerateInput` type added to `@tanstack/ai-client`.
+- `useGenerateAudio` hook added to `@tanstack/ai-react`,
+  `@tanstack/ai-solid`, and `@tanstack/ai-vue`; matching
+  `createGenerateAudio` added to `@tanstack/ai-svelte`. All follow the same
+  `{ generate, result, isLoading, error, status, stop, reset }` shape as
+  the existing media hooks and support both `connection` (SSE) and
+  `fetcher` transports.
@@ -0,0 +1,11 @@
+---
+'@tanstack/ai-fal': minor
+---
+
+feat: add audio, speech, and transcription adapters to @tanstack/ai-fal
+
+Adds three new tree-shakeable adapters alongside the existing `falImage()` and `falVideo()`:
+
+- `falSpeech()` — text-to-speech via models like Google `fal-ai/gemini-3.1-flash-tts`, `fal-ai/elevenlabs/tts/eleven-v3`, `fal-ai/minimax/speech-2.6-hd`, `fal-ai/kokoro/*`
+- `falTranscription()` — speech-to-text via `fal-ai/whisper`, `fal-ai/wizper`, `fal-ai/speech-to-text/turbo`, `fal-ai/elevenlabs/speech-to-text`
+- `falAudio()` — music and sound-effect generation via `fal-ai/minimax-music/v2.6`, `fal-ai/diffrhythm`, `fal-ai/lyria2`, `fal-ai/stable-audio-25/text-to-audio`, `fal-ai/elevenlabs/sound-effects/v2`
@@ -0,0 +1,14 @@
+---
+'@tanstack/ai-gemini': minor
+---
+
+feat(ai-gemini): add Lyria 3 Pro / Clip audio adapter and Gemini 3.1 Flash TTS
+
+**New adapter:**
+
+- `geminiAudio()` for Google Lyria music generation — supports `lyria-3-pro-preview` (full-length songs, MP3/WAV 48 kHz stereo) and `lyria-3-clip-preview` (30-second MP3 clips)
+
+**Enhanced:**
+
+- Added `gemini-3.1-flash-tts-preview` to the TTS model list (70+ languages, 200+ audio tags for expressive control)
+- Added `multiSpeakerVoiceConfig` to `GeminiTTSProviderOptions` for 2-speaker dialogue generation
@@ -0,0 +1,20 @@
+---
+'@tanstack/ai': minor
+'@tanstack/ai-openrouter': patch
+'@tanstack/ai-fal': patch
+'@tanstack/ai-openai': patch
+'@tanstack/ai-gemini': patch
+'@tanstack/ai-grok': patch
+---
+
+Tighten `GeneratedImage` and `GeneratedAudio` to enforce exactly one of `url` or `b64Json` via a mutually-exclusive `GeneratedMediaSource` union.
+
+Both types previously declared `url?` and `b64Json?` as independently optional, which allowed meaningless `{}` values and objects that set both fields. They now require exactly one:
+
+```ts
+type GeneratedMediaSource =
+  | { url: string; b64Json?: never }
+  | { b64Json: string; url?: never }
+```
+
+Existing read patterns like `img.url || \`data:image/png;base64,${img.b64Json}\``continue to work unchanged. The only runtime-visible change is that the`@tanstack/ai-openrouter`and`@tanstack/ai-fal`image adapters no longer populate`url`with a synthesized`data:image/png;base64,...`URI when the provider returns base64 — they return`{ b64Json }`only. Consumers that want a data URI should build it from`b64Json` at render time.
@@ -13,7 +13,7 @@ keywords:
   - adapter
 ---
 
-The fal.ai adapter provides access to 600+ models on the fal.ai platform for image generation and video generation. Unlike text-focused adapters, the fal adapter is **media-focused** — it supports `generateImage()` and `generateVideo()` but does not support `chat()` or tools. Audio and speech support are coming soon.
+The fal.ai adapter provides access to 600+ models on the fal.ai platform for image, video, audio, speech, and transcription. Unlike text-focused adapters, the fal adapter is **media-focused** — it supports `generateImage()`, `generateVideo()`, `generateAudio()`, `generateSpeech()`, and `generateTranscription()` but does not support `chat()` or tools.
 
 For a full working example, see the [fal.ai example app](https://github.com/TanStack/ai/tree/main/examples/ts-react-media).
 
@@ -209,6 +209,113 @@ const job = await generateVideo({
 });
 ```
 
+## Text-to-Speech
+
+Text-to-speech uses `falSpeech()` with the `generateSpeech()` activity. The adapter fetches the generated audio from fal's CDN and returns it as base64-encoded data to match the `TTSResult` contract.
+
+```typescript
+import { generateSpeech } from "@tanstack/ai";
+import { falSpeech } from "@tanstack/ai-fal";
+
+const result = await generateSpeech({
+  adapter: falSpeech("fal-ai/kokoro/american-english"),
+  text: "Hello from fal!",
+  voice: "af_heart",
+  speed: 1.0,
+});
+
+// result.audio is a base64-encoded string
+console.log(result.format); // e.g. "wav"
+console.log(result.contentType); // e.g. "audio/wav"
+```
+
+### Google Gemini 3.1 Flash TTS
+
+Google's newest TTS model (`fal-ai/gemini-3.1-flash-tts`) supports 80+ languages and introduces **granular audio tags** for expressive control — you can embed speaker tags and style cues directly in the text.
+
+```typescript
+const result = await generateSpeech({
+  adapter: falSpeech("fal-ai/gemini-3.1-flash-tts"),
+  text: "[warm, enthusiastic] Welcome to TanStack AI! [pause] Let's build something great.",
+  voice: "Kore",
+});
+```
+
+> **Note:** This model is newer than `@fal-ai/client@1.9.1`'s type map, so `modelOptions` won't autocomplete. The call still works — the fal adapter accepts any model ID as a string. Type-safe autocomplete will land when fal's SDK types catch up.
+
+### ElevenLabs v3
+
+```typescript
+const result = await generateSpeech({
+  adapter: falSpeech("fal-ai/elevenlabs/tts/eleven-v3"),
+  text: "Welcome to TanStack AI.",
+  modelOptions: {
+    voice: "Rachel",
+    stability: 0.5,
+  },
+});
+```
+
+## Transcription
+
+Speech-to-text uses `falTranscription()` with the `generateTranscription()` activity. The `audio` input accepts a URL string, `Blob`, `File`, or `ArrayBuffer` — `ArrayBuffer` is automatically wrapped in a `Blob` for upload.
+
+```typescript
+import { generateTranscription } from "@tanstack/ai";
+import { falTranscription } from "@tanstack/ai-fal";
+
+const result = await generateTranscription({
+  adapter: falTranscription("fal-ai/whisper"),
+  audio: "https://example.com/recording.mp3",
+  language: "en",
+});
+
+console.log(result.text);
+console.log(result.language);
+
+// When the model returns word/segment timestamps, they're mapped to result.segments
+for (const segment of result.segments ?? []) {
+  console.log(`[${segment.start}s → ${segment.end}s] ${segment.text}`);
+}
+```
+
+## Audio Generation (Music & Sound Effects)
+
+Music and sound-effect generation uses `falAudio()` with the `generateAudio()` activity. Unlike TTS, the result is returned as a URL in `result.audio.url` (you can fetch it yourself if you need raw bytes).
+
+```typescript
+import { generateAudio } from "@tanstack/ai";
+import { falAudio } from "@tanstack/ai-fal";
+
+// Music generation with MiniMax Music 2.6 (latest)
+const music = await generateAudio({
+  adapter: falAudio("fal-ai/minimax-music/v2.6"),
+  prompt: "City Pop, 80s retro, groovy synth bass, warm female vocal, 104 BPM, nostalgic urban night",
+});
+
+console.log(music.audio.url);
+```
+
+```typescript
+// DiffRhythm with explicit lyrics
+const lyrical = await generateAudio({
+  adapter: falAudio("fal-ai/diffrhythm"),
+  prompt: "An upbeat electronic track with synths",
+  modelOptions: {
+    lyrics: "[verse]\nHello world\n[chorus]\nLa la la",
+  },
+});
+```
+
+```typescript
+// Sound effects
+const sfx = await generateAudio({
+  adapter: falAudio("fal-ai/elevenlabs/sound-effects/v2"),
+  prompt: "Thunderclap with rain",
+  duration: 5,
+});
+```
+
 ## Popular Models
 
 ### Image Models
@@ -233,6 +340,49 @@ const job = await generateVideo({
 | `fal-ai/ltx-2/text-to-video/fast` | Text-to-Video | Fast text-to-video |
 | `fal-ai/ltx-2/image-to-video/fast` | Image-to-Video | Fast image-to-video animation |
 
+### Text-to-Speech Models
+
+| Model | Description |
+|-------|-------------|
+| `fal-ai/gemini-3.1-flash-tts` | **New** — Google's flagship TTS with 80+ languages and expressive audio tags |
+| `fal-ai/elevenlabs/tts/eleven-v3` | ElevenLabs v3 expressive multi-voice TTS |
+| `fal-ai/elevenlabs/tts/turbo-v2.5` | Low-latency ElevenLabs TTS |
+| `fal-ai/minimax/speech-2.6-hd` | MiniMax HD speech synthesis |
+| `fal-ai/minimax/speech-2.6-turbo` | MiniMax low-latency variant |
+| `fal-ai/kokoro/american-english` | Kokoro multilingual TTS — also `british-english`, `french`, `spanish`, `italian`, `japanese`, `mandarin-chinese`, `hindi`, `brazilian-portuguese` |
+| `fal-ai/inworld-tts` | Inworld TTS-1.5 Max |
+| `fal-ai/chatterbox/text-to-speech/multilingual` | Chatterbox multilingual TTS |
+| `fal-ai/dia-tts` | Dia expressive dialogue TTS |
+| `fal-ai/orpheus-tts` | Orpheus open-source TTS |
+| `fal-ai/f5-tts` | F5-TTS voice cloning |
+| `fal-ai/vibevoice/7b` | VibeVoice 7B conversational TTS |
+
+### Transcription Models
+
+| Model | Description |
+|-------|-------------|
+| `fal-ai/whisper` | OpenAI Whisper on fal infra |
+| `fal-ai/wizper` | Faster-whisper variant with word-level timestamps |
+| `fal-ai/speech-to-text/turbo` | Turbo STT with diarization |
+| `fal-ai/elevenlabs/speech-to-text` | ElevenLabs STT |
+
+### Audio / Music Models
+
+| Model | Mode | Description |
+|-------|------|-------------|
+| `fal-ai/minimax-music/v2.6` | Music | **New** — MiniMax Music 2.6, full vocal + instrumental compositions from a prompt |
+| `fal-ai/minimax-music/v2.5` | Music | MiniMax Music 2.5 |
+| `fal-ai/minimax-music/v2` | Music | MiniMax Music v2 — supports `lyrics_prompt` |
+| `fal-ai/diffrhythm` | Music | DiffRhythm — prompt + lyrics |
+| `fal-ai/lyria2` | Music | Google Lyria 2 high-fidelity music |
+| `fal-ai/stable-audio-25/text-to-audio` | Music / Audio | Stability AI Stable Audio 2.5 |
+| `fal-ai/mmaudio-v2/text-to-audio` | Audio | MMAudio v2 text-to-audio |
+| `fal-ai/elevenlabs/sound-effects/v2` | SFX | ElevenLabs sound-effect generation |
+| `fal-ai/beatoven/sound-effect-generation` | SFX | Beatoven professional sound effects |
+| `fal-ai/thinksound` | Audio | Thinksound reasoning-based audio generation |
+
+> **Very new models** (e.g. `gemini-3.1-flash-tts`, `minimax-music/v2.6`, `beatoven/sound-effect-generation`) may not yet appear in `@fal-ai/client`'s type map — they still work as plain string model IDs, you just won't get autocomplete for their `modelOptions`.
+
 ## Environment Variables
 
 Create an API key at [fal.ai](https://fal.ai) and set it in your environment:
@@ -275,6 +425,42 @@ Creates a fal.ai video adapter using the `FAL_KEY` environment variable or an ex
 
 Alias for `falVideo()`.
 
+### `falSpeech(model, config?)`
+
+Creates a fal.ai text-to-speech adapter.
+
+**Parameters:**
+
+- `model` - The fal.ai TTS model ID (e.g., `"fal-ai/kokoro/american-english"`)
+- `config.apiKey?` - Your fal.ai API key (falls back to `FAL_KEY` env var)
+- `config.proxyUrl?` - Proxy URL for client-side usage
+
+**Returns:** A `FalSpeechAdapter` instance for use with `generateSpeech()`. The adapter fetches the generated audio URL from fal and returns it as base64 in `result.audio`.
+
+### `falTranscription(model, config?)`
+
+Creates a fal.ai transcription (speech-to-text) adapter.
+
+**Parameters:**
+
+- `model` - The fal.ai STT model ID (e.g., `"fal-ai/whisper"`)
+- `config.apiKey?` - Your fal.ai API key (falls back to `FAL_KEY` env var)
+- `config.proxyUrl?` - Proxy URL for client-side usage
+
+**Returns:** A `FalTranscriptionAdapter` instance for use with `generateTranscription()`.
+
+### `falAudio(model, config?)`
+
+Creates a fal.ai audio generation adapter (music and sound effects).
+
+**Parameters:**
+
+- `model` - The fal.ai audio model ID (e.g., `"fal-ai/diffrhythm"`, `"fal-ai/minimax-music/v2"`)
+- `config.apiKey?` - Your fal.ai API key (falls back to `FAL_KEY` env var)
+- `config.proxyUrl?` - Proxy URL for client-side usage
+
+**Returns:** A `FalAudioAdapter` instance for use with `generateAudio()`. The result contains a URL at `result.audio.url`.
+
 ### `getFalApiKeyFromEnv()`
 
 Reads the `FAL_KEY` environment variable. Throws if not set.
@@ -288,9 +474,8 @@ Configures the underlying `@fal-ai/client`. Called automatically by adapter cons
 ## Limitations
 
 - **No text/chat support** — Use OpenAI, Anthropic, Gemini, or another text adapter for `chat()`
-- **No tools support** — Tool definitions are not applicable to image/video generation
+- **No tools support** — Tool definitions are not applicable to media generation
 - **No summarization** — Use a text adapter for `summarize()`
-- **No TTS or transcription yet** — Audio and speech support are coming soon
 - **Video is experimental** — The video generation API may change in future releases
 
 ## Next Steps
 
@@ -128,7 +128,7 @@ const logger: Logger = {
 | Category | Logs | Applies to |
 |----------|------|------------|
 | `request` | Outgoing call to a provider (model, message count, tool count) | All activities |
-| `provider` | Every raw chunk/frame received from a provider SDK | Streaming activities (chat, realtime) |
+| `provider` | Every raw chunk/frame received from a provider SDK | Streaming activities (`chat`, `realtime`, and streaming `generateAudio`/`generateSpeech`/`generateTranscription`) |
 | `output` | Every chunk or result yielded to the caller | All activities |
 | `middleware` | Inputs and outputs around every middleware hook | `chat()` only |
 | `tools` | Before/after tool call execution | `chat()` only |
@@ -154,8 +154,12 @@ The same `debug` option works on every activity:
 summarize({ adapter, text, debug: true });
 generateImage({ adapter, prompt: "a cat", debug: { logger } });
 generateSpeech({ adapter, text, debug: { request: true } });
+generateAudio({ adapter, prompt: "ambient piano", debug: true });
+generateTranscription({ adapter, audio, debug: { provider: true } });
 ```
 
+When streaming any of these (`generateAudio`, `generateSpeech`, `generateTranscription` with `stream: true`), the `provider` category emits the raw SDK chunks and `output` emits the AG-UI-shaped chunks yielded to the caller — useful when a media pipeline looks stuck or the bytes arriving don't match what you expected.
+
 The chat-only categories (`middleware`, `tools`, `agentLoop`, `config`) simply never fire for these activities because those concepts don't exist in their pipelines.
 
 ## Related
 
@@ -146,6 +146,10 @@
           "label": "Transcription",
           "to": "media/transcription"
         },
+        {
+          "label": "Audio Generation",
+          "to": "media/audio-generation"
+        },
         {
           "label": "Image Generation",
           "to": "media/image-generation"