Skip to content

Commit 8c66c88

Browse files
authored
Add audio chat (#32)
1 parent 39d0339 commit 8c66c88

6 files changed

Lines changed: 444 additions & 92 deletions

File tree

README.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ ONNX export and inference tools for [LFM2](https://www.liquid.ai/liquid-foundati
2020
| **LFM2.5**, **LFM2** | fp32, fp16, q4, q8 |
2121
| **LFM2.5-VL**, **LFM2-VL** | fp32, fp16, q4, q8 |
2222
| **LFM2-MoE** | fp32, fp16, q4, q4f16 |
23+
| **LFM2.5-Audio** | fp32, fp16, q4, q8 |
2324

2425

2526
## 2. Installation
@@ -110,6 +111,47 @@ uv run lfm2-moe-infer --model ./exports/LFM2-MoE-8B-A1B-ONNX/onnx/model_q4.onnx
110111
uv run lfm2-moe-infer --model ./exports/LFM2-MoE-8B-A1B-ONNX/onnx/model_q4.onnx --cpu
111112
```
112113

114+
### 4.4 Audio (ASR, TTS, Interleaved)
115+
116+
LFM2.5-Audio is a multimodal audio-language model supporting three modes:
117+
- **ASR** (Automatic Speech Recognition): Transcribe audio to text
118+
- **TTS** (Text-to-Speech): Generate audio from text
119+
- **Interleaved**: Mixed text and audio input/output for conversational audio
120+
121+
The model uses 5 ONNX components:
122+
- `decoder.onnx` - LFM2 language model backbone
123+
- `audio_encoder.onnx` - Conformer encoder for ASR input
124+
- `audio_embedding.onnx` - Audio code embeddings for TTS/interleaved
125+
- `audio_detokenizer.onnx` - Converts audio codes to STFT features
126+
- `vocoder_depthformer.onnx` - Autoregressive audio codebook prediction
127+
128+
```bash
129+
# ASR: Transcribe audio to text
130+
uv run lfm2-audio-infer LFM2.5-Audio-1.5B-ONNX --mode asr \
131+
--audio input.wav --precision q4
132+
133+
# TTS: Generate speech from text
134+
uv run lfm2-audio-infer LFM2.5-Audio-1.5B-ONNX --mode tts \
135+
--prompt "Hello, how are you today?" \
136+
--system "Perform TTS. Use the UK female voice." \
137+
--output output.wav --precision q4
138+
139+
# Interleaved: Audio input with text+audio response
140+
uv run lfm2-audio-infer LFM2.5-Audio-1.5B-ONNX --mode interleaved \
141+
--audio question.wav --output response.wav --precision q4
142+
143+
# Interactive chat mode (multi-turn with stateful KV cache)
144+
uv run lfm2-audio-infer LFM2.5-Audio-1.5B-ONNX --mode interleaved --chat \
145+
--output output.wav --precision q4
146+
# Commands in chat mode:
147+
# /audio <file> [text] - Send audio with optional text
148+
# <text> - Send text message
149+
# reset - Clear conversation state
150+
# quit - Exit
151+
```
152+
153+
> **Note:** Audio inference requires the model directory path (not a single .onnx file) since it loads multiple components. Use `--precision` to select quantization level (fp16, q4, q8).
154+
113155
## 5. Testing
114156

115157
Tests verify ONNX exports against PyTorch reference models.
@@ -149,6 +191,9 @@ uv run lfm2-bench --model LiquidAI/LFM2.5-1.2B-Instruct \
149191
**Vision-Language:**
150192
- [LiquidAI/LFM2.5-VL-1.6B-ONNX](https://huggingface.co/LiquidAI/LFM2.5-VL-1.6B-ONNX)
151193

194+
**Audio:**
195+
- [LiquidAI/LFM2.5-Audio-1.5B-ONNX](https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B-ONNX)
196+
152197
### 6.2 onnx-community
153198

154199
**Text models:**

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ lfm2-vl-infer = "liquidonnx.lfm2_vl.infer:main"
4141
lfm2-moe-export = "liquidonnx.lfm2_moe.export:main"
4242
lfm2-moe-infer = "liquidonnx.lfm2_moe.infer:main"
4343

44-
# LFM2.5-Audio model tools
44+
# LFM2.5-Audio multimodal audio model tools
4545
lfm2-audio-export = "liquidonnx.lfm2_audio.export:main"
4646
lfm2-audio-infer = "liquidonnx.lfm2_audio.infer:main"
4747

scripts/tts_onnx_fp16.sh

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,12 @@
22
set -e
33
set -x
44
mkdir -p output
5+
6+
SYSTEM_PROMPT="Perform TTS. Use the UK female voice."
7+
58
uv run lfm2-audio-infer exports/LFM2.5-Audio-1.5B-ONNX \
69
--mode tts \
710
--precision fp16 \
811
--prompt "Don't ask what you can do for your country. Ask what your country can do for you." \
12+
--system "$SYSTEM_PROMPT" \
913
--output output/tts_onnx_fp16.wav

scripts/tts_onnx_q4.sh

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,12 @@
22
set -e
33
set -x
44
mkdir -p output
5+
6+
SYSTEM_PROMPT="Perform TTS. Use the UK female voice."
7+
58
uv run lfm2-audio-infer exports/LFM2.5-Audio-1.5B-ONNX \
69
--mode tts \
710
--precision q4 \
811
--prompt "Don't ask what you can do for your country. Ask what your country can do for you." \
12+
--system "$SYSTEM_PROMPT" \
913
--output output/tts_onnx_q4.wav

scripts/tts_onnx_q8.sh

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,12 @@
22
set -e
33
set -x
44
mkdir -p output
5+
6+
SYSTEM_PROMPT="Perform TTS. Use the UK female voice."
7+
58
uv run lfm2-audio-infer exports/LFM2.5-Audio-1.5B-ONNX \
69
--mode tts \
710
--precision q8 \
811
--prompt "Don't ask what you can do for your country. Ask what your country can do for you." \
12+
--system "$SYSTEM_PROMPT" \
913
--output output/tts_onnx_q8.wav

0 commit comments

Comments
 (0)