Skip to content

Silero based audio chunker #857

Open
@yujonglee

Description

@yujonglee

We use audio chunker to do whisper inference in streaming manner.

let chunked =
audio_source.chunks(hypr_chunker::RMS::new(), std::time::Duration::from_secs(15));
hypr_whisper::local::TranscribeChunkedAudioStreamExt::transcribe(chunked, model)

Good chunker is important. At the high-level, it should be

  • Max 30sec(Whisper constraint). Users want to see result in faster tempo. Targeting around 12se. Might need some scoring mechanism. VAD_prob * buffer_length
  • Should split based on slience, should strip slience as much as possible. (Filter out silences #662) Whisper tends to hallucinate a lot on empty audio.

Our current approach:

chunker/stream.rs works with pluggable predictor.

Currently we use very simple RMS-based predictor:

impl Predictor for RMS {
fn predict(&self, samples: &[f32]) -> Result<bool, crate::Error> {
if samples.is_empty() {
return Ok(false);
}
let sum_squares: f32 = samples.iter().map(|&sample| sample * sample).sum();
let mean_square = sum_squares / samples.len() as f32;
let rms = mean_square.sqrt();
Ok(rms > 0.009)
}
}

Max-length constraint:

while this.buffer.len() < max_samples {


silero-rs is well-tested implementation.(Blog post)

We are not using it because it is hard to force 30sec max constraint. (emotechlab/silero-rs#31)

https://github.com/emotechlab/silero-rs/blob/6e8637b9d06cac41bbfe47e9933289f16ecbf87f/src/lib.rs#L85-L99

https://github.com/emotechlab/silero-rs/blob/6e8637b9d06cac41bbfe47e9933289f16ecbf87f/src/lib.rs#L387-L399


We have dataset to test if chunker works well or not.

async fn test_chunker() {

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions