Skip to content

Commit 54523f5

Browse files
tombeckenhamclaudeautofix-ci[bot]AlemTuzlakgithub-actions[bot]
authored
feat: audio media support — fal audio/speech/STT adapters, Gemini Lyria + 3.1 Flash TTS, streaming generateAudio + hooks (#463)
* feat: add fal audio, speech, and transcription adapters Adds falSpeech, falTranscription, and falAudio adapters to @tanstack/ai-fal, completing fal's media coverage alongside image and video. Introduces a new generateAudio activity in @tanstack/ai for music and sound-effect generation, with matching devtools events and types. Closes #328 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: add ElevenLabs TTS/music/SFX/transcription adapters and Gemini Lyria + 3.1 Flash TTS Extends @tanstack/ai-elevenlabs (which already covers realtime voice) with Speech, Music, Sound Effects, and Transcription adapters, each tree-shakeable under its own import. Adds Gemini Lyria 3 Pro / Clip music generation via a new generateAudio adapter, plus the new Gemini 3.1 Flash TTS Preview model with multi-speaker dialogue support. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: document fal audio, speech, and transcription adapters Adds a new Audio Generation page, expands the fal adapter reference with sections for text-to-speech, transcription, and audio/music, and adds fal sections to the Text-to-Speech and Transcription guides. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: add example pages and tests for audio/tts providers Expand the ts-react-chat example with provider tabs for OpenAI, ElevenLabs, Gemini, and Fal on the TTS and transcription pages, plus a new /generations/audio page covering ElevenLabs Music, ElevenLabs SFX, Gemini Lyria, and Fal audio generation. Add a Gemini TTS unit test and wire an audio-gen feature into the E2E harness (adapter factory, API route, UI, fixture, and Playwright spec). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: apply automated fixes * docs: lead audio generation guide with Gemini and ElevenLabs Reorder the Audio Generation page so the direct Gemini (Lyria) and ElevenLabs (music/sfx) adapters appear before fal.ai, and update the environment variables + result-shape notes to cover all three providers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(ts-react-chat): add audio home tile, sample prompts, and fal model selector Expose an Audio tile on the welcome grid, offer one-click sample prompts for every audio provider, and let the Fal provider pick between current text-to-music models (default MiniMax v2.6). Threads a model override through the audio API and server fn. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: apply automated fixes * chore: split ElevenLabs audio adapters out to separate PR (#485) Moves the new ElevenLabs TTS / Music / SFX / Transcription REST adapters out of this PR into their own issue (#485) and branch (`elevenlabs-audio-adapters`) so the fal + Gemini audio work can ship independently. The follow-up PR will rebuild these adapters on top of the official `@elevenlabs/elevenlabs-js` SDK rather than hand-rolled fetch calls. Removed from this branch: - `packages/typescript/ai-elevenlabs/src/{adapters,utils,model-meta.ts}` and their tests (realtime voice code untouched) - ElevenLabs sections in `docs/media/audio-generation.md` - ElevenLabs entries in `examples/ts-react-chat` audio-providers catalog, server adapter factories, zod schemas, and default provider wiring - `@tanstack/ai-elevenlabs` bump from the audio changeset Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: apply automated fixes * fix(ai-fal, ai-gemini): audio adapter bug fixes - ai-fal: replace `btoa(String.fromCharCode(...bytes))` with a chunked helper; the spread form throws RangeError on any realistic TTS clip (V8 arg limit ~65k). - ai-gemini: honor `TTSOptions.voice` as a fallback for the prebuilt voice name, move `systemInstruction` inside `config` per the @google/genai contract, and wrap raw `audio/L16;codec=pcm` output in a RIFF/WAV container so the result is actually playable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(ts-react-chat): warn on rejected audio model overrides Log a warning instead of silently swapping to the default when a client sends a model id outside the provider's allowlist, so stale clients or typo'd config ids are debuggable. Also correct the AudioProviderConfig JSDoc to describe the models[] ordering as a non-binding UI convention. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: split generateAudio into generateMusic and generateSoundEffects Replaces the unreleased generateAudio activity with two distinct activities so music and sound-effects each have their own types, adapter kinds, provider factories, and devtools events. This lets providers advertise only the capabilities they support (Gemini Lyria is music-only; fal has distinct music and SFX catalogs) and leaves room for kind-specific options without a breaking change. - Core: generateMusic/generateSoundEffects activities and MusicAdapter/ SoundEffectsAdapter interfaces + bases; GeneratedAudio shared between MusicGenerationResult and SoundEffectsGenerationResult - Events: music:request:* and soundEffects:request:* replace audio:* - fal: falMusic + falSoundEffects factories sharing internal request/response helpers; FalMusic/FalSoundEffectsProviderOptions in model-meta - Gemini: geminiMusic/createGeminiMusic/GeminiMusicAdapter (Lyria is music-only so no SFX counterpart) - ts-react-chat: /generations/music and /generations/sound-effects routes backed by a shared AudioGenerationForm; split server fns and API routes - E2E: music-gen + sound-effects-gen features, parameterized MediaAudioGenUI, split fixtures and specs (both feature support sets are empty since aimock 1.14 cannot mock Gemini's Lyria AUDIO modality) - Docs: music-generation.md + sound-effects-generation.md; fal adapter docs split; changesets rewritten in place Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Fixed type issue * Delete terminal output * revert: restore single generateAudio activity Supersedes 1010e9b. The split into generateMusic + generateSoundEffects doesn't hold up against fal's audio catalog: dozens of models span audio-to-audio, voice-change/clone, enhancement, separation, isolation, merge, and understanding, and individual models (e.g. stable-audio-25) generate music AND sound effects. A single broader generateAudio activity fits that reality. Keeps the aimock Gemini-Lyria gap: audio-gen feature-support stays empty because aimock 1.14 has no AUDIO-modality mock for generateContent — the E2E is green by skipping rather than by hitting a mock that doesn't exist. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: enforce exactly one of url or b64Json on GeneratedImage and GeneratedAudio Model GeneratedImage and GeneratedAudio on a shared mutually-exclusive GeneratedMediaSource union so the type rejects empty objects and objects that set both fields. Update the openai, gemini, grok, openrouter, and fal image adapters to construct results by branching on which field is present; openrouter and fal no longer synthesize a data URI on url when returning base64. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: apply automated fixes * chore(e2e): drop audio-gen scaffolding pending aimock support The audio-gen feature set was empty because aimock cannot currently mock audio generation, so the Playwright spec ran against zero providers. Remove the dead scaffolding; the wiring can return once aimock audio support lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: add useGenerateAudio hook and streaming support for generateAudio Closes the parity gap with the other media activities — audio generation now has the same client-hook UX (connection + fetcher transports) as image, speech, video, transcription, and summarize. Adds streaming to generateAudio so it can ride the SSE transport, a matching AudioGenerateInput type in ai-client, framework hooks in ai-react / ai-solid / ai-vue / ai-svelte, unit tests, an updated ts-react-chat example, and docs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ai-fal): translate duration per audio model Fal audio models use different input field names for length: ElevenLabs Music takes `music_length_ms` in milliseconds, Stable Audio 2.5 takes `seconds_total`, and most others accept `duration`. The adapter was passing a generic `duration` unconditionally, so the slider in the example was silently ignored for ElevenLabs and Stable Audio. Also: align the Gemini Lyria adapter with the API's MP3 default (only send responseMimeType when the caller asks for WAV), expand the example to include Lyria 3 Pro and a dedicated Fal SFX provider, and rename the example's "Direct" mode to "Hooks" to better reflect what it demos. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(ai-gemini): rename GEMINI_LYRIA_MODELS to GEMINI_AUDIO_MODELS Align the audio model constant and its re-export with the `generateAudio` activity naming used across providers, and drop the unused duplicate `GeminiLyriaModel` type — `GeminiAudioModel` is the single canonical type. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ai-gemini): address CR findings — constructor config, TTS model name, PCM channels, voice validation, image error surfacing * fix(ai-fal): address CR findings — generateId entropy, fetch.ok guards, response-shape validation, size params, proxy+apiKey, content types * fix(ai-fal): throw on unknown image response shape instead of returning empty * fix(ai-image-adapters): fix double-wrapped errors, duplicate keys, signature mismatch, null guards * fix(ai-gemini): address CR findings — test import, image model output meta, option filtering * fix(example-ts-react-chat): blob URL revocation, route link, body validation, falsy duration render * fix(ai-core): emit adapter-error events, consistent async, reordered base adapter ctor, type sync * fix(ai-openrouter): drop redundant null guards that TS types already enforce The defensive nullish-coalescing on response.choices and img/img.imageUrl guards that the fix-loop added are impossible per the SDK type signatures; eslint's no-unnecessary-condition correctly rejects them. Keep only the typeof url !== 'string' check, which is a real runtime shape guard (imageUrl.url is typed as string but provider may send a non-string in rare degraded responses). * fix: address CodeRabbit review feedback — SSE types, mime normalization, voice validation, etc. Applies the reviewer-flagged changes that weren't load-bearing for the merge: - event-client: AudioRequestCompletedEvent.audio is now a mutually-exclusive {url; never b64Json} | {b64Json; never url} union so consumers can't read both fields simultaneously, mirroring the GeneratedAudio contract in core. - fal utils: extractUrlExtension now strips URL fragments and trailing slashes, parses via the URL API so a TLD like `.com` isn't mistaken for an extension, and only inspects the final path segment. - fal utils: deriveAudioContentType returns `audio/aac` for aac, separated from the `m4a`/`mp4` → `audio/mp4` case. - fal speech: prefer URL-derived extension when deriving `format`, and normalize `mpeg` → `mp3` so the field is a usable file extension. - gemini audio: drop `negativePrompt` (not accepted by GenerateContentConfig) and `responseMimeType` (Lyria Clip rejects it, Pro returns MP3 by default) from the public provider options surface, and document that the generic `duration` option is ignored by Lyria (Clip is fixed at 30s, Pro takes duration via the natural-language prompt). - gemini tts: multiSpeakerVoiceConfig.speakerVoiceConfigs length is now validated (1 or 2 speakers), partial user-supplied voiceConfig correctly falls back to the standard voice/'Kore' default, parsePcmMimeType tightens detection to exclude subtypes containing "wav" so containerized `audio/wav;codec=pcm` is no longer re-wrapped, and createGeminiSpeech / createGeminiAudio factory functions now spread config before the explicit apiKey argument so caller config can't silently override the API key. - ts-react-chat API routes: replace zod 4's removed `.flatten()` with `z.treeifyError()` for validation error details. - ts-react-chat audio route: `toAudioOutput` returns `null` per the `onResult` hook contract instead of throwing synchronously — failures are still surfaced via the hook's error state. - Updates the tests affected by the above behavior changes. * docs: document debug logging for new audio/speech/transcription activities - debug-logging.md: list generateAudio/generateTranscription in Non-chat activities section; clarify that the `provider` category now applies to streaming generateAudio/generateSpeech/generateTranscription calls too. - audio-generation.md, text-to-speech.md, transcription.md: add a single contextual callout at the moment a builder is most likely to need it (immediately before the Options table / next to Error Handling), pointing to the debug-logging guide. * docs(skill): add audio/speech CR gotchas + debug-logging to media-generation skill Agents hitting the new generateAudio/generateSpeech/generateTranscription activities will run into: - Gemini Lyria doesn't accept responseMimeType or negativePrompt via GenerateContentConfig — shape the prompt instead. - Lyria 3 Clip is fixed 30s; Lyria 3 Pro reads duration from natural-language in the prompt, not the duration option. fal audio maps duration per-model. - Gemini TTS multiSpeakerVoiceConfig is validated to 1 or 2 speakers. - debug: DebugOption is threaded through every generate*() activity — reach for it instead of writing logging middleware. Adds four Common Mistake entries, sources the debug-logging doc, and cross-references the ai-core/debug-logging sub-skill. * fix(ai-fal): decode data URL audio inputs to Blob for transcription fal-client auto-uploads Blob/File inputs via fal.storage.upload but passes strings through unchanged, so data URLs reached fal's API and got rejected with 422 "Unsupported data URL". Decode data URL strings to a Blob in buildInput so the auto-upload path handles them; plain http(s) URLs still pass through. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: regenerate API documentation (#494) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com> Co-authored-by: Alem Tuzlak <t.zlak@hotmail.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
1 parent e33ee09 commit 54523f5

86 files changed

Lines changed: 6376 additions & 436 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.changeset/audio-activity.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
---
2+
'@tanstack/ai': minor
3+
---
4+
5+
feat: add generateAudio activity for music and sound-effect generation
6+
7+
Adds a new `audio` activity kind alongside the existing `tts` and `transcription` activities:
8+
9+
- `generateAudio()` / `createAudioOptions()` functions
10+
- `AudioAdapter` interface and `BaseAudioAdapter` base class
11+
- `AudioGenerationOptions` / `AudioGenerationResult` / `GeneratedAudio` types
12+
- `audio:request:started`, `audio:request:completed`, and `audio:usage` devtools events

.changeset/audio-example-pages.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
---
2+
---
3+
4+
chore: add ts-react-chat example pages and E2E coverage for audio providers
5+
6+
**Example pages** (`examples/ts-react-chat`):
7+
8+
- Updated text-to-speech and transcription pages with provider tabs (OpenAI, Gemini, Fal for TTS; OpenAI, Fal for transcription)
9+
- Added a new `/generations/audio` page covering Gemini Lyria and Fal audio generation
10+
- Added a shared `audio-providers` catalog and server-side adapter factories for audio, speech, and transcription
11+
12+
**Tests**:
13+
14+
- Added `@tanstack/ai-gemini` unit tests covering the Gemini TTS adapter (single-speaker default, multi-speaker config, missing-audio errors)
15+
- Added a new `audio-gen` feature to the E2E harness with a Gemini Lyria adapter factory, route, UI, fixture, and spec
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
---
2+
'@tanstack/ai': minor
3+
'@tanstack/ai-client': minor
4+
'@tanstack/ai-react': minor
5+
'@tanstack/ai-solid': minor
6+
'@tanstack/ai-vue': minor
7+
'@tanstack/ai-svelte': minor
8+
---
9+
10+
feat: add `useGenerateAudio` hook and streaming support for `generateAudio()`
11+
12+
Closes the parity gap between audio generation and the other media
13+
activities (image, speech, video, transcription, summarize):
14+
15+
- `generateAudio()` now accepts `stream: true`, returning an
16+
`AsyncIterable<StreamChunk>` that can be piped through
17+
`toServerSentEventsResponse()`.
18+
- `AudioGenerateInput` type added to `@tanstack/ai-client`.
19+
- `useGenerateAudio` hook added to `@tanstack/ai-react`,
20+
`@tanstack/ai-solid`, and `@tanstack/ai-vue`; matching
21+
`createGenerateAudio` added to `@tanstack/ai-svelte`. All follow the same
22+
`{ generate, result, isLoading, error, status, stop, reset }` shape as
23+
the existing media hooks and support both `connection` (SSE) and
24+
`fetcher` transports.
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
---
2+
'@tanstack/ai-fal': minor
3+
---
4+
5+
feat: add audio, speech, and transcription adapters to @tanstack/ai-fal
6+
7+
Adds three new tree-shakeable adapters alongside the existing `falImage()` and `falVideo()`:
8+
9+
- `falSpeech()` — text-to-speech via models like Google `fal-ai/gemini-3.1-flash-tts`, `fal-ai/elevenlabs/tts/eleven-v3`, `fal-ai/minimax/speech-2.6-hd`, `fal-ai/kokoro/*`
10+
- `falTranscription()` — speech-to-text via `fal-ai/whisper`, `fal-ai/wizper`, `fal-ai/speech-to-text/turbo`, `fal-ai/elevenlabs/speech-to-text`
11+
- `falAudio()` — music and sound-effect generation via `fal-ai/minimax-music/v2.6`, `fal-ai/diffrhythm`, `fal-ai/lyria2`, `fal-ai/stable-audio-25/text-to-audio`, `fal-ai/elevenlabs/sound-effects/v2`

.changeset/gemini-audio.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
---
2+
'@tanstack/ai-gemini': minor
3+
---
4+
5+
feat(ai-gemini): add Lyria 3 Pro / Clip audio adapter and Gemini 3.1 Flash TTS
6+
7+
**New adapter:**
8+
9+
- `geminiAudio()` for Google Lyria music generation — supports `lyria-3-pro-preview` (full-length songs, MP3/WAV 48 kHz stereo) and `lyria-3-clip-preview` (30-second MP3 clips)
10+
11+
**Enhanced:**
12+
13+
- Added `gemini-3.1-flash-tts-preview` to the TTS model list (70+ languages, 200+ audio tags for expressive control)
14+
- Added `multiSpeakerVoiceConfig` to `GeminiTTSProviderOptions` for 2-speaker dialogue generation
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
---
2+
'@tanstack/ai': minor
3+
'@tanstack/ai-openrouter': patch
4+
'@tanstack/ai-fal': patch
5+
'@tanstack/ai-openai': patch
6+
'@tanstack/ai-gemini': patch
7+
'@tanstack/ai-grok': patch
8+
---
9+
10+
Tighten `GeneratedImage` and `GeneratedAudio` to enforce exactly one of `url` or `b64Json` via a mutually-exclusive `GeneratedMediaSource` union.
11+
12+
Both types previously declared `url?` and `b64Json?` as independently optional, which allowed meaningless `{}` values and objects that set both fields. They now require exactly one:
13+
14+
```ts
15+
type GeneratedMediaSource =
16+
| { url: string; b64Json?: never }
17+
| { b64Json: string; url?: never }
18+
```
19+
20+
Existing read patterns like `img.url || \`data:image/png;base64,${img.b64Json}\``continue to work unchanged. The only runtime-visible change is that the`@tanstack/ai-openrouter`and`@tanstack/ai-fal`image adapters no longer populate`url`with a synthesized`data:image/png;base64,...`URI when the provider returns base64they return`{ b64Json }`only. Consumers that want a data URI should build it from`b64Json` at render time.

docs/adapters/fal.md

Lines changed: 188 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ keywords:
1313
- adapter
1414
---
1515

16-
The fal.ai adapter provides access to 600+ models on the fal.ai platform for image generation and video generation. Unlike text-focused adapters, the fal adapter is **media-focused** — it supports `generateImage()` and `generateVideo()` but does not support `chat()` or tools. Audio and speech support are coming soon.
16+
The fal.ai adapter provides access to 600+ models on the fal.ai platform for image, video, audio, speech, and transcription. Unlike text-focused adapters, the fal adapter is **media-focused** — it supports `generateImage()`, `generateVideo()`, `generateAudio()`, `generateSpeech()`, and `generateTranscription()` but does not support `chat()` or tools.
1717

1818
For a full working example, see the [fal.ai example app](https://github.com/TanStack/ai/tree/main/examples/ts-react-media).
1919

@@ -209,6 +209,113 @@ const job = await generateVideo({
209209
});
210210
```
211211

212+
## Text-to-Speech
213+
214+
Text-to-speech uses `falSpeech()` with the `generateSpeech()` activity. The adapter fetches the generated audio from fal's CDN and returns it as base64-encoded data to match the `TTSResult` contract.
215+
216+
```typescript
217+
import { generateSpeech } from "@tanstack/ai";
218+
import { falSpeech } from "@tanstack/ai-fal";
219+
220+
const result = await generateSpeech({
221+
adapter: falSpeech("fal-ai/kokoro/american-english"),
222+
text: "Hello from fal!",
223+
voice: "af_heart",
224+
speed: 1.0,
225+
});
226+
227+
// result.audio is a base64-encoded string
228+
console.log(result.format); // e.g. "wav"
229+
console.log(result.contentType); // e.g. "audio/wav"
230+
```
231+
232+
### Google Gemini 3.1 Flash TTS
233+
234+
Google's newest TTS model (`fal-ai/gemini-3.1-flash-tts`) supports 80+ languages and introduces **granular audio tags** for expressive control — you can embed speaker tags and style cues directly in the text.
235+
236+
```typescript
237+
const result = await generateSpeech({
238+
adapter: falSpeech("fal-ai/gemini-3.1-flash-tts"),
239+
text: "[warm, enthusiastic] Welcome to TanStack AI! [pause] Let's build something great.",
240+
voice: "Kore",
241+
});
242+
```
243+
244+
> **Note:** This model is newer than `@fal-ai/client@1.9.1`'s type map, so `modelOptions` won't autocomplete. The call still works — the fal adapter accepts any model ID as a string. Type-safe autocomplete will land when fal's SDK types catch up.
245+
246+
### ElevenLabs v3
247+
248+
```typescript
249+
const result = await generateSpeech({
250+
adapter: falSpeech("fal-ai/elevenlabs/tts/eleven-v3"),
251+
text: "Welcome to TanStack AI.",
252+
modelOptions: {
253+
voice: "Rachel",
254+
stability: 0.5,
255+
},
256+
});
257+
```
258+
259+
## Transcription
260+
261+
Speech-to-text uses `falTranscription()` with the `generateTranscription()` activity. The `audio` input accepts a URL string, `Blob`, `File`, or `ArrayBuffer``ArrayBuffer` is automatically wrapped in a `Blob` for upload.
262+
263+
```typescript
264+
import { generateTranscription } from "@tanstack/ai";
265+
import { falTranscription } from "@tanstack/ai-fal";
266+
267+
const result = await generateTranscription({
268+
adapter: falTranscription("fal-ai/whisper"),
269+
audio: "https://example.com/recording.mp3",
270+
language: "en",
271+
});
272+
273+
console.log(result.text);
274+
console.log(result.language);
275+
276+
// When the model returns word/segment timestamps, they're mapped to result.segments
277+
for (const segment of result.segments ?? []) {
278+
console.log(`[${segment.start}s → ${segment.end}s] ${segment.text}`);
279+
}
280+
```
281+
282+
## Audio Generation (Music & Sound Effects)
283+
284+
Music and sound-effect generation uses `falAudio()` with the `generateAudio()` activity. Unlike TTS, the result is returned as a URL in `result.audio.url` (you can fetch it yourself if you need raw bytes).
285+
286+
```typescript
287+
import { generateAudio } from "@tanstack/ai";
288+
import { falAudio } from "@tanstack/ai-fal";
289+
290+
// Music generation with MiniMax Music 2.6 (latest)
291+
const music = await generateAudio({
292+
adapter: falAudio("fal-ai/minimax-music/v2.6"),
293+
prompt: "City Pop, 80s retro, groovy synth bass, warm female vocal, 104 BPM, nostalgic urban night",
294+
});
295+
296+
console.log(music.audio.url);
297+
```
298+
299+
```typescript
300+
// DiffRhythm with explicit lyrics
301+
const lyrical = await generateAudio({
302+
adapter: falAudio("fal-ai/diffrhythm"),
303+
prompt: "An upbeat electronic track with synths",
304+
modelOptions: {
305+
lyrics: "[verse]\nHello world\n[chorus]\nLa la la",
306+
},
307+
});
308+
```
309+
310+
```typescript
311+
// Sound effects
312+
const sfx = await generateAudio({
313+
adapter: falAudio("fal-ai/elevenlabs/sound-effects/v2"),
314+
prompt: "Thunderclap with rain",
315+
duration: 5,
316+
});
317+
```
318+
212319
## Popular Models
213320

214321
### Image Models
@@ -233,6 +340,49 @@ const job = await generateVideo({
233340
| `fal-ai/ltx-2/text-to-video/fast` | Text-to-Video | Fast text-to-video |
234341
| `fal-ai/ltx-2/image-to-video/fast` | Image-to-Video | Fast image-to-video animation |
235342

343+
### Text-to-Speech Models
344+
345+
| Model | Description |
346+
|-------|-------------|
347+
| `fal-ai/gemini-3.1-flash-tts` | **New** — Google's flagship TTS with 80+ languages and expressive audio tags |
348+
| `fal-ai/elevenlabs/tts/eleven-v3` | ElevenLabs v3 expressive multi-voice TTS |
349+
| `fal-ai/elevenlabs/tts/turbo-v2.5` | Low-latency ElevenLabs TTS |
350+
| `fal-ai/minimax/speech-2.6-hd` | MiniMax HD speech synthesis |
351+
| `fal-ai/minimax/speech-2.6-turbo` | MiniMax low-latency variant |
352+
| `fal-ai/kokoro/american-english` | Kokoro multilingual TTS — also `british-english`, `french`, `spanish`, `italian`, `japanese`, `mandarin-chinese`, `hindi`, `brazilian-portuguese` |
353+
| `fal-ai/inworld-tts` | Inworld TTS-1.5 Max |
354+
| `fal-ai/chatterbox/text-to-speech/multilingual` | Chatterbox multilingual TTS |
355+
| `fal-ai/dia-tts` | Dia expressive dialogue TTS |
356+
| `fal-ai/orpheus-tts` | Orpheus open-source TTS |
357+
| `fal-ai/f5-tts` | F5-TTS voice cloning |
358+
| `fal-ai/vibevoice/7b` | VibeVoice 7B conversational TTS |
359+
360+
### Transcription Models
361+
362+
| Model | Description |
363+
|-------|-------------|
364+
| `fal-ai/whisper` | OpenAI Whisper on fal infra |
365+
| `fal-ai/wizper` | Faster-whisper variant with word-level timestamps |
366+
| `fal-ai/speech-to-text/turbo` | Turbo STT with diarization |
367+
| `fal-ai/elevenlabs/speech-to-text` | ElevenLabs STT |
368+
369+
### Audio / Music Models
370+
371+
| Model | Mode | Description |
372+
|-------|------|-------------|
373+
| `fal-ai/minimax-music/v2.6` | Music | **New** — MiniMax Music 2.6, full vocal + instrumental compositions from a prompt |
374+
| `fal-ai/minimax-music/v2.5` | Music | MiniMax Music 2.5 |
375+
| `fal-ai/minimax-music/v2` | Music | MiniMax Music v2 — supports `lyrics_prompt` |
376+
| `fal-ai/diffrhythm` | Music | DiffRhythm — prompt + lyrics |
377+
| `fal-ai/lyria2` | Music | Google Lyria 2 high-fidelity music |
378+
| `fal-ai/stable-audio-25/text-to-audio` | Music / Audio | Stability AI Stable Audio 2.5 |
379+
| `fal-ai/mmaudio-v2/text-to-audio` | Audio | MMAudio v2 text-to-audio |
380+
| `fal-ai/elevenlabs/sound-effects/v2` | SFX | ElevenLabs sound-effect generation |
381+
| `fal-ai/beatoven/sound-effect-generation` | SFX | Beatoven professional sound effects |
382+
| `fal-ai/thinksound` | Audio | Thinksound reasoning-based audio generation |
383+
384+
> **Very new models** (e.g. `gemini-3.1-flash-tts`, `minimax-music/v2.6`, `beatoven/sound-effect-generation`) may not yet appear in `@fal-ai/client`'s type map — they still work as plain string model IDs, you just won't get autocomplete for their `modelOptions`.
385+
236386
## Environment Variables
237387

238388
Create an API key at [fal.ai](https://fal.ai) and set it in your environment:
@@ -275,6 +425,42 @@ Creates a fal.ai video adapter using the `FAL_KEY` environment variable or an ex
275425

276426
Alias for `falVideo()`.
277427

428+
### `falSpeech(model, config?)`
429+
430+
Creates a fal.ai text-to-speech adapter.
431+
432+
**Parameters:**
433+
434+
- `model` - The fal.ai TTS model ID (e.g., `"fal-ai/kokoro/american-english"`)
435+
- `config.apiKey?` - Your fal.ai API key (falls back to `FAL_KEY` env var)
436+
- `config.proxyUrl?` - Proxy URL for client-side usage
437+
438+
**Returns:** A `FalSpeechAdapter` instance for use with `generateSpeech()`. The adapter fetches the generated audio URL from fal and returns it as base64 in `result.audio`.
439+
440+
### `falTranscription(model, config?)`
441+
442+
Creates a fal.ai transcription (speech-to-text) adapter.
443+
444+
**Parameters:**
445+
446+
- `model` - The fal.ai STT model ID (e.g., `"fal-ai/whisper"`)
447+
- `config.apiKey?` - Your fal.ai API key (falls back to `FAL_KEY` env var)
448+
- `config.proxyUrl?` - Proxy URL for client-side usage
449+
450+
**Returns:** A `FalTranscriptionAdapter` instance for use with `generateTranscription()`.
451+
452+
### `falAudio(model, config?)`
453+
454+
Creates a fal.ai audio generation adapter (music and sound effects).
455+
456+
**Parameters:**
457+
458+
- `model` - The fal.ai audio model ID (e.g., `"fal-ai/diffrhythm"`, `"fal-ai/minimax-music/v2"`)
459+
- `config.apiKey?` - Your fal.ai API key (falls back to `FAL_KEY` env var)
460+
- `config.proxyUrl?` - Proxy URL for client-side usage
461+
462+
**Returns:** A `FalAudioAdapter` instance for use with `generateAudio()`. The result contains a URL at `result.audio.url`.
463+
278464
### `getFalApiKeyFromEnv()`
279465

280466
Reads the `FAL_KEY` environment variable. Throws if not set.
@@ -288,9 +474,8 @@ Configures the underlying `@fal-ai/client`. Called automatically by adapter cons
288474
## Limitations
289475

290476
- **No text/chat support** — Use OpenAI, Anthropic, Gemini, or another text adapter for `chat()`
291-
- **No tools support** — Tool definitions are not applicable to image/video generation
477+
- **No tools support** — Tool definitions are not applicable to media generation
292478
- **No summarization** — Use a text adapter for `summarize()`
293-
- **No TTS or transcription yet** — Audio and speech support are coming soon
294479
- **Video is experimental** — The video generation API may change in future releases
295480

296481
## Next Steps

docs/advanced/debug-logging.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -128,7 +128,7 @@ const logger: Logger = {
128128
| Category | Logs | Applies to |
129129
|----------|------|------------|
130130
| `request` | Outgoing call to a provider (model, message count, tool count) | All activities |
131-
| `provider` | Every raw chunk/frame received from a provider SDK | Streaming activities (chat, realtime) |
131+
| `provider` | Every raw chunk/frame received from a provider SDK | Streaming activities (`chat`, `realtime`, and streaming `generateAudio`/`generateSpeech`/`generateTranscription`) |
132132
| `output` | Every chunk or result yielded to the caller | All activities |
133133
| `middleware` | Inputs and outputs around every middleware hook | `chat()` only |
134134
| `tools` | Before/after tool call execution | `chat()` only |
@@ -154,8 +154,12 @@ The same `debug` option works on every activity:
154154
summarize({ adapter, text, debug: true });
155155
generateImage({ adapter, prompt: "a cat", debug: { logger } });
156156
generateSpeech({ adapter, text, debug: { request: true } });
157+
generateAudio({ adapter, prompt: "ambient piano", debug: true });
158+
generateTranscription({ adapter, audio, debug: { provider: true } });
157159
```
158160

161+
When streaming any of these (`generateAudio`, `generateSpeech`, `generateTranscription` with `stream: true`), the `provider` category emits the raw SDK chunks and `output` emits the AG-UI-shaped chunks yielded to the caller — useful when a media pipeline looks stuck or the bytes arriving don't match what you expected.
162+
159163
The chat-only categories (`middleware`, `tools`, `agentLoop`, `config`) simply never fire for these activities because those concepts don't exist in their pipelines.
160164

161165
## Related

docs/config.json

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -146,6 +146,10 @@
146146
"label": "Transcription",
147147
"to": "media/transcription"
148148
},
149+
{
150+
"label": "Audio Generation",
151+
"to": "media/audio-generation"
152+
},
149153
{
150154
"label": "Image Generation",
151155
"to": "media/image-generation"

0 commit comments

Comments
 (0)