Releases: Blaizzy/mlx-audio
Releases · Blaizzy/mlx-audio
v0.4.0
What's Changed
- add example of qwen3-asr with forced alignment by @eschmidbauer in #463
- Restore Qwen3-TTS encoder_config to preserve accents in voice clones. by @orbitalquark in #461
- Ensure TTS audio player plays a trailing, partially-filled audio frame. by @orbitalquark in #465
- Fix source separation issue with shape mismatch: noise shape for separate_long by @mnoukhov in #467
- Enable streaming for Qwen3-TTS when ICL mode is enabled. by @orbitalquark in #466
- feat(stt): add support for MedASR (Lasr architecture) by @sigjhl in #376
- Formatting fix and add to pyproject.toml by @mnoukhov in #475
- Use shared model cache resolution for SAM-Audio by @mnoukhov in #468
- Fix voice matching for Pocket TTS by @lucasnewman in #477
- Fix ALMs max tokens and chunking by @Blaizzy in #474
- [Soprano] Fix decoder and config loading (v1 and v1.1) by @Blaizzy in #480
- Add audio separation UI & Server by @Blaizzy in #347
- Add Parakeet v3 multilingual support with language detection by @andimarafioti in #481
- Revert "Do not discard the last unfilled audio frame. (#465)" by @orbitalquark in #473
- Fix longform generation for Pocket TTS by @lucasnewman in #486
- Enable streaming /v1/audio/speech server endpoint with raw/pcm data. by @orbitalquark in #484
- feat: Add support for Voxtral Mini 4B Realtime by @shreyaskarnik in #487
- fix(kokoro): Chinese TTS crashes with ValueError in g2p pipeline by @smartchainark in #489
- fix: update STT transcription parameters and preserve original audio format by @shreyaskarnik in #488
- fix(docs): update outdated model links by @joaopalmeiro in #495
- [VAD / Diarization] Add sortformer by @Blaizzy in #493
- feat: add streaming support and toggle for realtime STT by @Cold-A-Muse in #494
- chore: update protobuf dependency to version 6.33.5 by @Blaizzy in #497
- fix(ui): update footer to display the current year dynamically and link to the repo by @shreyaskarnik in #509
- Add Smart Turn v3 semantic VAD by @lucasnewman in #511
- fix(vibevoice-asr): add audio resampling and normalization to preprocessing by @bellkjtt in #510
- Refactor(whisper): update model instantiation and loading process by @Blaizzy in #514
- fix: replace 4 bare excepts with except Exception by @haosenwang1018 in #521
- feat(stt): add system_prompt parameter to Qwen3ASR generation methods by @chris-schra in #522
- fix(medasr): collapse CTC tokens manually to prevent raw output by @sigjhl in #519
- Add Echo TTS by @lucasnewman in #525
- Allow printing transcriptions to stdout when output path is "-". by @orbitalquark in #527
- stt: Keep uploaded file extension to avoid unnecessary conversions. by @orbitalquark in #528
- feat(lid): Add spoken language identification (MMS-LID) by @beshkenadze in #529
- Set max tokens to a more reasonable value by default for STT by @lucasnewman in #533
- [Qwen3-TTS] Improve inference, TTFB and add batch support by @Blaizzy in #534
- refactor(codec): extract shared ECAPA-TDNN backbone by @beshkenadze in #532
- Add KittenTTS support and ONNX parity fixes by @Reza2kn in #517
- Fix Qwen3-TTS streaming decoder throttling with incremental decoding by @Blaizzy in #537
- Add ming omni tts (MoE and Dense) by @Blaizzy in #515
- [Qwen3-ASR] Fix auto lang detection by @Blaizzy in #547
- Add nvfp4, mxfp4 and mxfp8 quants by @Blaizzy in #543
- Fix duplicate audio_samples field in GenerationResult dataclass by @Blaizzy in #548
- [Whisper] Fix lang code assignment by @Blaizzy in #549
New Contributors
- @eschmidbauer made their first contribution in #463
- @orbitalquark made their first contribution in #461
- @mnoukhov made their first contribution in #467
- @sigjhl made their first contribution in #376
- @andimarafioti made their first contribution in #481
- @shreyaskarnik made their first contribution in #487
- @smartchainark made their first contribution in #489
- @joaopalmeiro made their first contribution in #495
- @Cold-A-Muse made their first contribution in #494
- @bellkjtt made their first contribution in #510
- @haosenwang1018 made their first contribution in #521
- @chris-schra made their first contribution in #522
- @Reza2kn made their first contribution in #517
Full Changelog: v0.3.1...v0.4.0
v0.3.1
What's Changed
- Update uv.lock to reflect dependency version changes by @Blaizzy in #432
- v0.3.1: Update STT API docs and fix default output path by @Blaizzy in #433
- Qwen3-TTS: Add streaming and optimise peak usage by @Blaizzy in #435
- Fix: Use single quotes in README examples to avoid Bash history expansion. by @reinexworldc in #440
- Fix: improve import error hadling by @reinexworldc in #443
- [Qwen3-TTS] Fix some Custom Voices producing silence with 0.6B by @Blaizzy in #444
- Refactor audio load by @Blaizzy in #445
- Update pyproject.toml for poetry support by @lucasnewman in #446
- Add Qwen3-ASR by @Blaizzy in #454
- Fix chatterbox load by @Blaizzy in #455
- Update README to remove basic usage section by @Blaizzy in #456
- Update README with output path for ASR commands by @rahimnathwani in #458
- Update package dependencies in uv.lock to include new extras by @Blaizzy in #457
- Fix server (STT, TTS) by @Blaizzy in #460
New Contributors
- @reinexworldc made their first contribution in #440
Full Changelog: v0.3.0...v0.3.1
v0.3.0
What's Changed
- Fix speaker embedding extraction in Qwen3-TTS model by @Blaizzy in #390
- Fix Qwen3-TTS tail artifacts by @Blaizzy in #391
- Fix Qwen3-TTS Base Voice Cloning by @Blaizzy in #394
- Add Vibevoice ASR by @Blaizzy in #389
- Qwen3 speaker embedding tests by @Blaizzy in #396
- Update TTS commands in README to include language code option by @rudolfolah in #401
- Unify Mimi implementation for Pocket TTS by @lucasnewman in #403
- Fix issue of ref_audio not loading prior to inference with server. by @BuffMcBigHuge in #406
- Enhance README with installation and usage examples by @rahimnathwani in #404
- Upgrade GitHub Actions for Node 24 compatibility by @salmanmkc in #418
- Upgrade GitHub Actions to latest versions by @salmanmkc in #419
- [VibeVoice-ASR] Fix Metal kernel crash and optimize memory for long audio by @Blaizzy in #417
- fix: Allowing quantization of Qwen3-TTS! Adding model_quant_predicate to Qwen3-TTS to exclude embedding layers by @kyr0 in #398
- Fix qwen3 tts quants (silence in VC and word precision) by @Blaizzy in #407
- Fix stt array io by @Blaizzy in #426
- Update MANIFEST.in to remove leading dot from requirements.txt path by @Blaizzy in #428
- Move audio path/format prints under verbose flag by @wladpaiva in #429
- Update pyproject.toml and GitHub Actions workflow for package publishing by @Blaizzy in #431
New Contributors
- @rudolfolah made their first contribution in #401
- @BuffMcBigHuge made their first contribution in #406
- @rahimnathwani made their first contribution in #404
- @salmanmkc made their first contribution in #418
- @kyr0 made their first contribution in #398
- @wladpaiva made their first contribution in #429
Full Changelog: v0.2.10...v0.3.0
v0.3.0rc1
What's Changed
- Remove extra deps by @Blaizzy in #373
- Refactor load by @Blaizzy in #374
- Add lfm2 audio by @Blaizzy in #370
- update lfm readme by @Blaizzy in #377
- Fix lang codes kokoro by @Blaizzy in #380
- Replace soundfile with miniaudio + ffmpeg by @Blaizzy in #379
- Add Pocket TTS model by @lucasnewman in #381
- Fix STT stream by @Blaizzy in #382
- Migrate swift to https://github.com/Blaizzy/mlx-audio-swift by @Blaizzy in #363
- Refactor model path retrieval in get_model_path function by @Blaizzy in #383
- Add streaming decoding to snac and orpheus by @Blaizzy in #384
- Update generate output path by @Blaizzy in #385
- Add Qwen3-TTS by @Blaizzy in #388
Full Changelog: v0.2.10...v0.3.0rc1
v0.2.10
What's Changed
- Refactor GLMASR and improve LM style ASR logging by @Blaizzy in #332
- Remove actual issue ID reference from PR template by @mootari in #334
- Add maya1 fixes to Llama by @Blaizzy in #340
- Fix marvis, chatterbox and args by @Blaizzy in #342
- Add Sam Audio by @Blaizzy in #338
- fix: Add missing mlx-lm dependency by @joshwhiton in #344
- feat(swift): add Kokoro-82M-v1.1-zh MLX Support by @Alex-Wengg in #341
- Remove loguru reconfiguration on Kokoro import by @joshwhiton in #348
- feat(stt): add AlignAtt streaming transcription for Whisper by @beshkenadze in #321
- Fix stft args by @Blaizzy in #354
- Allow using DACVAE as a codec independent of SAM Audio model by @lucasnewman in #357
- chore: update Python version requirement and dependencies by @Blaizzy in #355
- Add MossFormer2 SE (Speech Enhancement) by @starkdmi in #351
- use chatterbox MTLTokenizer for multilingual. by @litmudoc in #362
- Add streaming and refactor Sam Audio API by @Blaizzy in #360
- Add Soprano by @Blaizzy in #359
- Fix model type, refactor orpheus style models by @Blaizzy in #358
- revert default response format to mp3 by @Blaizzy in #356
- Refactor voice loading in KokoroPipeline to support .safetensors files by @Blaizzy in #364
- Add uv.lock and pin all deps as core by @Blaizzy in #366
New Contributors
- @mootari made their first contribution in #334
- @Alex-Wengg made their first contribution in #341
- @starkdmi made their first contribution in #351
- @litmudoc made their first contribution in #362
Full Changelog: v0.2.9...v0.2.10
v0.2.9
What's Changed
- Add GLM ASR by @Blaizzy in #320
- Simplify convert API for TTS and STT by @Blaizzy in #324
- [Chatterbox-turbo] Add speaker embedding by @Blaizzy in #322
- [Chatterbox-turbo] Add in-place cache by @Blaizzy in #322
- [Chatterbox-turbo] Add audio streaming by @Blaizzy in #322
- [Chatterbox-turbo] Add audio chunking by @Blaizzy in #322
Full Changelog: v0.2.6...v0.2.9
v0.2.8
What's Changed
- fix(server): use lowercase for default response_format by @beshkenadze in #301
- Add Chatterbox and Chatterbox Turbo by @Blaizzy in #302
- Add Chatterbox [VC only] by @DePasqualeOrg in #282
- feat: add lazy imports for TTS/STT modules by @beshkenadze in #290
- Pin tfms dep <5.0.0 by @Blaizzy in #303
- feat: migrate from setup.py to pyproject.toml with optional deps by @beshkenadze in #291
- fix(test): use case-insensitive content-type comparison by @beshkenadze in #300
- ci: add modular installation tests for pyproject.toml extras by @beshkenadze in #298
- Fix build by @Blaizzy in #304
Full Changelog: v0.2.7...v0.2.8
v0.2.7
What's Changed
- Refactor Marvis TTS API: Make public methods accessible by @rudrankriyam in #259
- Add Marvis model selection to TTS UI by @rudrankriyam in #261
- Add Marvis quant selection to TTS Web UI by @adrgrondin in #264
- Fix security vulnerabilities in Next.js and brace-expansion dependencies by @Copilot in #265
- make the methods more useful by @pritamsoni-hsr in #246
- Update to mlx-swift-lm and remove redundant mlx-swift dependency by @rudrankriyam in #269
- Fix voxtral segments by @Blaizzy in #273
- Add UI startup option to Server by @Blaizzy in #274
- Add preemphasis preprocessing support for Parakeet models to match NeMo training config by @joshwhiton in #286
- feat: Add support for VoxCPM (w/ voice cloning) by @voxmenthe in #293
- Fix spark decoding by @Blaizzy in #296
- feat: extract DSP utilities to dedicated module by @beshkenadze in #289
- Feat: add response format option to SpeechRequest by @Blaizzy in #297
- Add VibeVoice by @Blaizzy in #295
New Contributors
- @pritamsoni-hsr made their first contribution in #246
- @joshwhiton made their first contribution in #286
- @voxmenthe made their first contribution in #293
- @beshkenadze made their first contribution in #289
Full Changelog: v0.2.6...v0.2.7
v0.2.6
What's Changed
- fix wav2vec by @josharian in #222
- Fix RTF calculation in kokoro model by @davidxifeng in #227
- Fix Unnecessary Audio Transcription for the IndexTTS Model by @bytefer in #231
- Add Sesame TTS Integration for Swift Audio Package by @rudrankriyam in #223
- Use batched vocoding to reduce peak memory usage with Sesame arch models by @lucasnewman in #236
- Cache RoPE by dtype for Sesame arch models for improved generation performance by @lucasnewman in #232
- Install Metal toolchain for Swift tests by @lucasnewman in #233
- Adopt changes interface changes from mlx-lm to fix Sesame-arch models by @lucasnewman in #242
- Update swift-transformers dependency to 1.1.0 by @Liam1506 in #247
- Improve Swift TTS app UX by @rudrankriyam in #248
- Add quality selection and streaming controls to Marvis with UI support for macOS & iOS by @rudrankriyam in #249
- Fix Swift compiler warnings by @rudrankriyam in #250
- Refactor MarvisModel to handle optional backbone and decoder flavors by @rudrankriyam in #251
- Fix iOS 16 compatibility and ESpeakNG framework linking for iOS app by @rudrankriyam in #252
- Add memory increase limit for iOS by @rudrankriyam in #253
- Update audio playback management in Marvis TTS by @rudrankriyam in #254
- Bump version and add new copy files by @Blaizzy in #255
- Add UI v2 by @Blaizzy in #154
New Contributors
- @josharian made their first contribution in #222
- @davidxifeng made their first contribution in #227
- @bytefer made their first contribution in #231
- @Liam1506 made their first contribution in #247
Full Changelog: v0.2.5...v0.2.6
v0.2.5
What's Changed
- Use indeterminate progress for CSM models by @lucasnewman in #216
- Bump version to 0.2.5 by @Blaizzy in #219
Full Changelog: v0.2.4...v0.2.5