VOICE DETECTOR API

Dataset Update

The current human speech dataset was collected and prepared as follows:

Mozilla Common Voice: English, Hindi, Tamil
OpenSLR: Telugu, Malayalam
All downloaded clips were preprocessed and converted into fixed 3-second chunks for training and inference.

Rebuild Dataset After Local Deletion

If you delete all local datasets, recreate them with the steps below.

0) Install dependencies (one time per machine)

python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r backend\requirements.txt

1) Create required folders

New-Item -ItemType Directory -Force -Path processed\human\english,processed\human\hindi,processed\human\tamil,processed\human\telugu,processed\human\malayalam | Out-Null
New-Item -ItemType Directory -Force -Path processed\ai,processed\ai_augmented,processed\ai_artifacts,processed\human_degraded | Out-Null

2) Download source speech data again

Mozilla Common Voice (English, Hindi, Tamil): https://commonvoice.mozilla.org/
OpenSLR (Telugu, Malayalam): https://openslr.org/

Extract downloaded archives and collect audio files (.wav/.mp3) per language.

3) Preprocess all human clips into fixed 3-second WAV chunks

Use your preprocessing pipeline to:

convert to mono
resample to 16 kHz
normalize amplitude
split into fixed 3-second segments

Note: training/inference loaders in this repo use 4-second internal chunks; 3-second files are still valid and get padded automatically when needed.

Save final chunks under:

processed/human/english
processed/human/hindi
processed/human/tamil
processed/human/telugu
processed/human/malayalam

4) Recreate AI-side processed folders (if training again)

Put clean AI WAV files in processed/ai/<language>/, then run:

python data_pipeline/build_augmented_ai.py
python data_pipeline/build_ai_artifacts.py

If you need degraded human data, run data_pipeline/build_human_degraded.py and place output in processed/human_degraded/<language>/.

5) Rebuild Stage-1 training dataset

python data_pipeline/build_stage1_dataset.py

This generates:

stage_1/stage1_data/human
stage_1/stage1_data/ai

6) Optional classical-feature dataset rebuild

python data_pipeline/build_dataset_audio_only.py

This regenerates X.npy and y.npy. (used by baseline model)

Data Scripts Used

Data pipeline (`data_pipeline/`)

build_dataset_audio_only.py: builds/organizes the base audio-only dataset structure.
build_stage1_dataset.py: prepares Stage-1 dataset folders (ai / human).
augment_ai.py: applies synthetic degradations/perturbations to AI audio.
build_augmented_ai.py: generates and saves augmented AI samples at scale.
build_human_degraded.py: creates degraded/noisy human variants for robustness.
dataset.py: common/general dataset loader utilities.
dataset_stage1.py: Stage-1 dataset loader utilities.
dataset_stage2.py: Stage-2 dataset loader utilities.

Stage-1 scripts (`stage_1/`)

build_stage1_dataset.py: Stage-1 local dataset preparation.
dataset_stage1.py: wav loading, resampling, chunk sampling for Stage-1.
stage1_feature_extraction.py: feature extraction pipeline for feature-based Stage-1 experiments.
stage1_model.py: Stage-1 model definition used by inference/training code.
train_stage1_final.py: final Stage-1 wav2vec2 training script (epoch checkpoints like stage1_detector_epochN.pt).
eval_stage1.py: evaluation script for Stage-1 model performance checks.

Training scripts (`training/`)

train_stage1_final.py: training entrypoint for Stage-1 detector.
train_stage2_aasist.py: training entrypoint for Stage-2 AASIST verifier.
stage1_model.py: training-side Stage-1 model module.

1️⃣ Evolution of the System

🔹 Phase 1 – LGBM LoLo Baseline Model

The first version of the system used a LightGBM (LGBM) classifier trained on low-level handcrafted acoustic features.

📊 Performance

Accuracy: ~85%
Weakness: Failed on high-quality studio recordings
Training bias: Mostly low-quality, noisy recordings
Could not generalize to:
- Clean AI
- Studio human voices
- Augmented / compressed audio

This version served as a strong baseline but exposed the need for:

Better representation learning
Robustness to degradation
AI artifact detection

2️⃣ Phase 2 – Hand-Crafted Human / AI Feature Enhancements

To fix baseline weaknesses, we introduced signal-level feature engineering.

🧠 Human-Degraded Detection

We defined degraded audio using:

Signal-to-Noise Ratio (SNR)

[ SNR_{dB} = 20 \log_{10} \left( \frac{RMS}{NoiseFloor} \right) ]

Where:

( RMS = \sqrt{\frac{1}{N} \sum x^2} )
NoiseFloor = 5th percentile amplitude

Degraded if:

SNR < 18 dB

Spectral Flatness

[ Flatness = \frac{\exp(\text{mean}(\log(P)))}{\text{mean}(P)} ]

Where:

(P) = power spectrum

Higher flatness → more noise-like Lower flatness → tonal/studio

Clipping Ratio

[ Clipped = \frac{#(|x| > 0.99)}{N} ]

Used to detect distortion.

🤖 AI Artifact Detection

We engineered AI-specific indicators:

Repetition Score
- MFCC cosine similarity across chunks
- High similarity → looped patterns
Pitch Variance
- Estimated F0 via autocorrelation
- Low variance → synthetic
Vocoder Artifact Score [ Score = Flatness + 0.02(1 - HF_{ratio}) ]
Dynamics Ratio [ dyn = \frac{\sigma(RMS)}{\mu(RMS)} ]

Low dynamics → monotone AI

This improved robustness, but classical ML still lacked deep representation power.

3️⃣ Phase 3 – Stage 1 Deep Model (Wav2Vec2)

We upgraded to a representation learning model:

Architecture

facebook/wav2vec2-base
Mean pooled embeddings
MLP head:
- 768 → 256 → 1

Stage 1 Role

Binary classifier:

Outputs probability of AI

Stage 1 Constraints & Thresholds

S1_HUMAN_RECHECK_THRESHOLD = 0.40
S1_AI_CHECK_THRESHOLD = 0.75
S1_VERY_CONFIDENT = 0.90
S1_CONFIDENT = 0.82

Decision Logic

Range	Meaning
< 0.40	HUMAN (trusted immediately)
0.40–0.75	Ambiguous
> 0.75	AI candidate
> 0.90	Very confident AI

Stage 1 became strong — but occasionally overconfident.

So we built Stage 2.

4️⃣ Stage 2 – AASIST Advanced Verifier

What is AASIST?

AASIST is a graph attention-based backend that analyzes:

Frame-level temporal patterns
Spectral artifacts
Subtle vocoder cues
Cross-frame dependencies

Architecture:

Wav2Vec2 Encoder → AASIST Backend → Classifier

🧠 Core Innovation

Confidence-Weighted Adaptive Verification

Instead of hard thresholds:

Different confidence levels require different strategies.

🎯 Decision Flow

Case 1: Stage 1 says HUMAN (< 0.40)

✔ Trust immediately ✔ No AASIST verification ✔ Minimizes false AI labeling

Case 2: Stage 1 says AI (> 0.75)

Tiered adaptive logic:

TIER 1: Very Confident AI (> 0.90)

AASIST	Decision
≥ 0.40	AI (70% S1 + 30% AASIST)
0.20–0.40	AI (trust S1)
< 0.20	HUMAN

TIER 2: Confident AI (0.82–0.90)

AASIST	Decision
≥ 0.45	AI (65/35 weighted)
0.25–0.45	Weighted check
< 0.25	Feature-based tie break

TIER 3: Moderate AI (0.75–0.82)

AASIST	Decision
≥ 0.50	AI (50/50)
0.30–0.50	INCONCLUSIVE
< 0.30	HUMAN

Case 3: Stage 1 Ambiguous (0.40–0.75)

Trust AASIST more:

[ FinalScore = 0.4 \cdot S1 + 0.6 \cdot AASIST ]

AASIST	Decision
≥ 0.55	AI
< 0.35	HUMAN
0.35–0.55	Weighted decision

📊 Why This Is Powerful

✅ Reduces False Positives

Human voices rarely cross S1 < 0.40.

✅ Reduces False Negatives

Very confident S1 AI (>0.90) no longer blocked by mild AASIST uncertainty.

✅ Handles Studio Audio

Hand-crafted features compensate for over-clean recordings.

✅ Handles Augmented AI

Vocoder + repetition detection catches artifacts.

📈 Example

Example:

S1 = 0.85
AASIST = 0.38

Weighted:

[ 0.6(0.85) + 0.4(0.38) = 0.662 ]

→ AI

Old system → INCONCLUSIVE New system → Correct AI

🏗 Architecture Overview

Audio Input
    ↓
Chunking (4s, 50% overlap)
    ↓
Stage 1 (Wav2Vec2 + MLP)
    ↓
Hand-crafted feature extraction
    ↓
AASIST (conditional verification)
    ↓
Confidence-weighted decision fusion
    ↓
Final Label + Explanation

🔍 Explanation Engine

The system produces structured layman explanations based on:

Breathing detection
Pitch variance
Dynamics ratio
Repetition score
Vocoder artifacts
Clipping
Reverb
Micro-noises

This makes the system explainable — critical for hackathon judging.

🚀 Final System Characteristics

Feature	Supported
Studio Human	✅
Low Quality Human	✅
Augmented AI	✅
Clean AI	✅
Edge Cases	INCONCLUSIVE
Confidence Output	Yes
Explainability	Yes

It is a multi-stage adaptive AI verification system built to handle real-world edge cases. This system is LANGUAGE INDEPENDENT.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
baseline_model		baseline_model
data_pipeline		data_pipeline
deploy		deploy
stage_1		stage_1
stage_2		stage_2
training		training
utils		utils
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

VOICE DETECTOR API

Dataset Update

Rebuild Dataset After Local Deletion

0) Install dependencies (one time per machine)

1) Create required folders

2) Download source speech data again

3) Preprocess all human clips into fixed 3-second WAV chunks

4) Recreate AI-side processed folders (if training again)

5) Rebuild Stage-1 training dataset

6) Optional classical-feature dataset rebuild

Data Scripts Used

Data pipeline (data_pipeline/)

Stage-1 scripts (stage_1/)

Training scripts (training/)

1️⃣ Evolution of the System

🔹 Phase 1 – LGBM LoLo Baseline Model

📊 Performance

2️⃣ Phase 2 – Hand-Crafted Human / AI Feature Enhancements

🧠 Human-Degraded Detection

Signal-to-Noise Ratio (SNR)

Spectral Flatness

Clipping Ratio

🤖 AI Artifact Detection

3️⃣ Phase 3 – Stage 1 Deep Model (Wav2Vec2)

Architecture

Stage 1 Role

Stage 1 Constraints & Thresholds

Decision Logic

4️⃣ Stage 2 – AASIST Advanced Verifier

What is AASIST?

🧠 Core Innovation

Confidence-Weighted Adaptive Verification

🎯 Decision Flow

Case 1: Stage 1 says HUMAN (< 0.40)

Case 2: Stage 1 says AI (> 0.75)

TIER 1: Very Confident AI (> 0.90)

TIER 2: Confident AI (0.82–0.90)

TIER 3: Moderate AI (0.75–0.82)

Case 3: Stage 1 Ambiguous (0.40–0.75)

📊 Why This Is Powerful

✅ Reduces False Positives

✅ Reduces False Negatives

✅ Handles Studio Audio

✅ Handles Augmented AI

📈 Example

🏗 Architecture Overview

🔍 Explanation Engine

🚀 Final System Characteristics

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Data pipeline (`data_pipeline/`)

Stage-1 scripts (`stage_1/`)

Training scripts (`training/`)

Packages