Silero based audio chunker

We use audio chunker to do whisper inference in streaming manner. 

https://github.com/fastrepl/hyprnote/blob/573d5ef602b93638e7dc73a87ed64917d18bfc8a/plugins/local-stt/src/server.rs#L148-L150

Good chunker is important. At the high-level, it should be
- Max 30sec(Whisper constraint). Users want to see result in faster tempo. Targeting around 12se. Might need some scoring mechanism. `VAD_prob * buffer_length`
- Should split based on slience, should strip slience as much as possible. (#662) Whisper tends to hallucinate a lot on empty audio.

---

Our current approach:

[chunker/stream.rs](https://github.com/fastrepl/hyprnote/blob/573d5ef602b93638e7dc73a87ed64917d18bfc8a/crates/chunker/src/stream.rs#) works with pluggable predictor.

Currently we use very simple RMS-based predictor:

https://github.com/fastrepl/hyprnote/blob/573d5ef602b93638e7dc73a87ed64917d18bfc8a/crates/chunker/src/predictor.rs#L14-L25

Max-length constraint: 

https://github.com/fastrepl/hyprnote/blob/573d5ef602b93638e7dc73a87ed64917d18bfc8a/crates/chunker/src/stream.rs#L70

---

[silero-rs](https://github.com/emotechlab/silero-rs) is well-tested implementation.([Blog post](https://xd009642.github.io/2024/08/23/snapshot-testing-neural-networks.html))

We are not using it because it is hard to force 30sec max constraint. (https://github.com/emotechlab/silero-rs/issues/31)


https://github.com/emotechlab/silero-rs/blob/6e8637b9d06cac41bbfe47e9933289f16ecbf87f/src/lib.rs#L85-L99

https://github.com/emotechlab/silero-rs/blob/6e8637b9d06cac41bbfe47e9933289f16ecbf87f/src/lib.rs#L387-L399

---

We have dataset to test if chunker works well or not.

https://github.com/fastrepl/hyprnote/blob/573d5ef602b93638e7dc73a87ed64917d18bfc8a/crates/chunker/src/lib.rs#L33

	impl Predictor for RMS {
	fn predict(&self, samples: &[f32]) -> Result<bool, crate::Error> {
	if samples.is_empty() {
	return Ok(false);
	}

	let sum_squares: f32 = samples.iter().map(\|&sample\| sample * sample).sum();
	let mean_square = sum_squares / samples.len() as f32;
	let rms = mean_square.sqrt();
	Ok(rms > 0.009)
	}
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Silero based audio chunker #857

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	let chunked =
	audio_source.chunks(hypr_chunker::RMS::new(), std::time::Duration::from_secs(15));
	hypr_whisper::local::TranscribeChunkedAudioStreamExt::transcribe(chunked, model)

Silero based audio chunker #857

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions