[WIP][TSS][Model] Kimi-Audio-7B by zhangj1an · Pull Request #2941 · vllm-project/vllm-omni

zhangj1an · 2026-04-20T09:18:20Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Latest Status [27 Apr]

still running benchmark.

Purpose

Closes #1824.

Kimi-Audio consists of three main components:

Audio Tokenizer: Converts input audio into:
- Discrete semantic tokens (12.5Hz) using vector quantization.
- Continuous acoustic features derived from a Whisper encoder (downsampled to 12.5Hz).
Audio LLM: A transformer-based model (initialized from a pre-trained text LLM like Qwen 2.5 7B) with shared layers processing multimodal inputs, followed by parallel heads for autoregressively generating text tokens and discrete audio semantic tokens.
Audio Detokenizer: Converts the predicted discrete semantic audio tokens back into high-fidelity waveforms using a flow-matching model and a vocoder (BigVGAN), supporting chunk-wise streaming with a look-ahead mechanism for low latency.

Supports 3 tasks:

ASR (audio → text)
audio-to-audio chat (audio in → audio + text out)
multi-turn audio conversation

Key Adaptations

For transformer: re-uses qwen2 decoder.
For vocoder: still keeps its original kimi bigvan implemenation. using qwen2 bigvan vocoder degrades audio generation quality.

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: Zhang Jian <jianmusings@gmail.com>

hsliuustc0106 · 2026-04-20T09:21:39Z

Ready for full review when draft status removed. Preliminary scan available on request.

Signed-off-by: Zhang Jian <jianmusings@gmail.com>

zhangj1an added 2 commits April 18, 2026 17:06

add kimi audio part 1/3

fcc3e28

Signed-off-by: Zhang Jian <jianmusings@gmail.com>

vibe coded needed components. will clean up now

ddb39b8

Signed-off-by: Zhang Jian <jianmusings@gmail.com>

zhangj1an added 26 commits April 20, 2026 09:36

Merge remote-tracking branch 'origin/main' into jian/kimiaudio

e41aabd

underwent a skills review

3dc6693

Signed-off-by: Zhang Jian <jianmusings@gmail.com>

merge main

6b396ea

Signed-off-by: Zhang Jian <jianmusings@gmail.com>

merge example file

5808ce0

Signed-off-by: Zhang Jian <jianmusings@gmail.com>

add example url

8dacad2

Signed-off-by: Zhang Jian <jianmusings@gmail.com>

remove dependency on kimia_infer

8c224ba

Signed-off-by: Zhang Jian <jianmusings@gmail.com>

remove upstream bigvgan

eaafa3a

Signed-off-by: Zhang Jian <jianmusings@gmail.com>

remove scheduler

3fa43a5

Signed-off-by: Zhang Jian <jianmusings@gmail.com>

use existing rope

92a64f7

Signed-off-by: Zhang Jian <jianmusings@gmail.com>

refactor wrapper

337ed3e

Signed-off-by: Zhang Jian <jianmusings@gmail.com>

inline rope

6761bb4

Signed-off-by: Zhang Jian <jianmusings@gmail.com>

remove comments

060c67d

Signed-off-by: Zhang Jian <jianmusings@gmail.com>

remove wrapper

a8bd5c7

Signed-off-by: Zhang Jian <jianmusings@gmail.com>

simplify config

a7bab84

Signed-off-by: Zhang Jian <jianmusings@gmail.com>

merge main

1458bdc

Signed-off-by: Zhang Jian <jianmusings@gmail.com>

refactor layout

62d7ea0

Signed-off-by: Zhang Jian <jianmusings@gmail.com>

refactor format

7af501a

Signed-off-by: Zhang Jian <jianmusings@gmail.com>

update yaml format

68d21f7

Signed-off-by: Zhang Jian <jianmusings@gmail.com>

remove redundant code

a90520c

Signed-off-by: Zhang Jian <jianmusings@gmail.com>

precommit check

230f915

Signed-off-by: Zhang Jian <jianmusings@gmail.com>

remove dead code, but output has noise

ff845c5

Signed-off-by: Zhang Jian <jianmusings@gmail.com>

add dynamic cache

c47b84f

Signed-off-by: Zhang Jian <jianmusings@gmail.com>

voice is better

8f68f6c

Signed-off-by: Zhang Jian <jianmusings@gmail.com>

qwen2.5bigvan does not work, switch to kimi bigvan

99f3dcf

Signed-off-by: Zhang Jian <jianmusings@gmail.com>

merge main

d40a5b3

Signed-off-by: Zhang Jian <jianmusings@gmail.com>

asr can run now

d049a8a

Signed-off-by: Zhang Jian <jianmusings@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][TSS][Model] Kimi-Audio-7B#2941

[WIP][TSS][Model] Kimi-Audio-7B#2941
zhangj1an wants to merge 28 commits intovllm-project:mainfrom
zhangj1an:jian/kimiaudio

zhangj1an commented Apr 20, 2026 •

edited

Loading

Uh oh!

hsliuustc0106 commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zhangj1an commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Latest Status [27 Apr]

Purpose

Key Adaptations

Test Plan

Test Result

Uh oh!

hsliuustc0106 commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhangj1an commented Apr 20, 2026 •

edited

Loading