Description
We use audio chunker to do whisper inference in streaming manner.
hyprnote/plugins/local-stt/src/server.rs
Lines 148 to 150 in 573d5ef
Good chunker is important. At the high-level, it should be
- Max 30sec(Whisper constraint). Around 12sec or so.
- Should split based on slience, should strip slience as much as possible. (Filter out silences #662) Whisper tends to hallucinate a lot on empty audio.
Our current approach:
chunker/stream.rs works with pluggable predictor.
Currently we use very simple RMS-based predictor:
hyprnote/crates/chunker/src/predictor.rs
Lines 14 to 25 in 573d5ef
Max-length constraint:
hyprnote/crates/chunker/src/stream.rs
Line 70 in 573d5ef
silero-rs is well-tested implementation.(Blog post)
We are not using it because it is hard to force 30sec max constraint. (emotechlab/silero-rs#31)
We have dataset to test if chunker works well or not.
hyprnote/crates/chunker/src/lib.rs
Line 33 in 573d5ef