The current human speech dataset was collected and prepared as follows:
- Mozilla Common Voice: English, Hindi, Tamil
- OpenSLR: Telugu, Malayalam
- All downloaded clips were preprocessed and converted into fixed 3-second chunks for training and inference.
If you delete all local datasets, recreate them with the steps below.
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r backend\requirements.txtNew-Item -ItemType Directory -Force -Path processed\human\english,processed\human\hindi,processed\human\tamil,processed\human\telugu,processed\human\malayalam | Out-Null
New-Item -ItemType Directory -Force -Path processed\ai,processed\ai_augmented,processed\ai_artifacts,processed\human_degraded | Out-Null- Mozilla Common Voice (English, Hindi, Tamil): https://commonvoice.mozilla.org/
- OpenSLR (Telugu, Malayalam): https://openslr.org/
Extract downloaded archives and collect audio files (.wav/.mp3) per language.
Use your preprocessing pipeline to:
- convert to mono
- resample to 16 kHz
- normalize amplitude
- split into fixed 3-second segments
Note: training/inference loaders in this repo use 4-second internal chunks; 3-second files are still valid and get padded automatically when needed.
Save final chunks under:
processed/human/englishprocessed/human/hindiprocessed/human/tamilprocessed/human/teluguprocessed/human/malayalam
Put clean AI WAV files in processed/ai/<language>/, then run:
python data_pipeline/build_augmented_ai.py
python data_pipeline/build_ai_artifacts.pyIf you need degraded human data, run data_pipeline/build_human_degraded.py and place output in processed/human_degraded/<language>/.
python data_pipeline/build_stage1_dataset.pyThis generates:
stage_1/stage1_data/humanstage_1/stage1_data/ai
python data_pipeline/build_dataset_audio_only.pyThis regenerates X.npy and y.npy. (used by baseline model)
build_dataset_audio_only.py: builds/organizes the base audio-only dataset structure.build_stage1_dataset.py: prepares Stage-1 dataset folders (ai/human).augment_ai.py: applies synthetic degradations/perturbations to AI audio.build_augmented_ai.py: generates and saves augmented AI samples at scale.build_human_degraded.py: creates degraded/noisy human variants for robustness.dataset.py: common/general dataset loader utilities.dataset_stage1.py: Stage-1 dataset loader utilities.dataset_stage2.py: Stage-2 dataset loader utilities.
build_stage1_dataset.py: Stage-1 local dataset preparation.dataset_stage1.py: wav loading, resampling, chunk sampling for Stage-1.stage1_feature_extraction.py: feature extraction pipeline for feature-based Stage-1 experiments.stage1_model.py: Stage-1 model definition used by inference/training code.train_stage1_final.py: final Stage-1 wav2vec2 training script (epoch checkpoints likestage1_detector_epochN.pt).eval_stage1.py: evaluation script for Stage-1 model performance checks.
train_stage1_final.py: training entrypoint for Stage-1 detector.train_stage2_aasist.py: training entrypoint for Stage-2 AASIST verifier.stage1_model.py: training-side Stage-1 model module.
The first version of the system used a LightGBM (LGBM) classifier trained on low-level handcrafted acoustic features.
-
Accuracy: ~85%
-
Weakness: Failed on high-quality studio recordings
-
Training bias: Mostly low-quality, noisy recordings
-
Could not generalize to:
- Clean AI
- Studio human voices
- Augmented / compressed audio
This version served as a strong baseline but exposed the need for:
- Better representation learning
- Robustness to degradation
- AI artifact detection
To fix baseline weaknesses, we introduced signal-level feature engineering.
We defined degraded audio using:
[ SNR_{dB} = 20 \log_{10} \left( \frac{RMS}{NoiseFloor} \right) ]
Where:
- ( RMS = \sqrt{\frac{1}{N} \sum x^2} )
- NoiseFloor = 5th percentile amplitude
Degraded if:
- SNR < 18 dB
[ Flatness = \frac{\exp(\text{mean}(\log(P)))}{\text{mean}(P)} ]
Where:
- (P) = power spectrum
Higher flatness → more noise-like Lower flatness → tonal/studio
[ Clipped = \frac{#(|x| > 0.99)}{N} ]
Used to detect distortion.
We engineered AI-specific indicators:
-
Repetition Score
- MFCC cosine similarity across chunks
- High similarity → looped patterns
-
Pitch Variance
- Estimated F0 via autocorrelation
- Low variance → synthetic
-
Vocoder Artifact Score [ Score = Flatness + 0.02(1 - HF_{ratio}) ]
-
Dynamics Ratio [ dyn = \frac{\sigma(RMS)}{\mu(RMS)} ]
Low dynamics → monotone AI
This improved robustness, but classical ML still lacked deep representation power.
We upgraded to a representation learning model:
-
facebook/wav2vec2-base -
Mean pooled embeddings
-
MLP head:
- 768 → 256 → 1
Binary classifier:
- Outputs probability of AI
S1_HUMAN_RECHECK_THRESHOLD = 0.40
S1_AI_CHECK_THRESHOLD = 0.75
S1_VERY_CONFIDENT = 0.90
S1_CONFIDENT = 0.82| Range | Meaning |
|---|---|
| < 0.40 | HUMAN (trusted immediately) |
| 0.40–0.75 | Ambiguous |
| > 0.75 | AI candidate |
| > 0.90 | Very confident AI |
Stage 1 became strong — but occasionally overconfident.
So we built Stage 2.
AASIST is a graph attention-based backend that analyzes:
- Frame-level temporal patterns
- Spectral artifacts
- Subtle vocoder cues
- Cross-frame dependencies
Architecture:
Wav2Vec2 Encoder → AASIST Backend → Classifier
Instead of hard thresholds:
Different confidence levels require different strategies.
✔ Trust immediately ✔ No AASIST verification ✔ Minimizes false AI labeling
Tiered adaptive logic:
| AASIST | Decision |
|---|---|
| ≥ 0.40 | AI (70% S1 + 30% AASIST) |
| 0.20–0.40 | AI (trust S1) |
| < 0.20 | HUMAN |
| AASIST | Decision |
|---|---|
| ≥ 0.45 | AI (65/35 weighted) |
| 0.25–0.45 | Weighted check |
| < 0.25 | Feature-based tie break |
| AASIST | Decision |
|---|---|
| ≥ 0.50 | AI (50/50) |
| 0.30–0.50 | INCONCLUSIVE |
| < 0.30 | HUMAN |
Trust AASIST more:
[ FinalScore = 0.4 \cdot S1 + 0.6 \cdot AASIST ]
| AASIST | Decision |
|---|---|
| ≥ 0.55 | AI |
| < 0.35 | HUMAN |
| 0.35–0.55 | Weighted decision |
Human voices rarely cross S1 < 0.40.
Very confident S1 AI (>0.90) no longer blocked by mild AASIST uncertainty.
Hand-crafted features compensate for over-clean recordings.
Vocoder + repetition detection catches artifacts.
Example:
S1 = 0.85
AASIST = 0.38
Weighted:
[ 0.6(0.85) + 0.4(0.38) = 0.662 ]
→ AI
Old system → INCONCLUSIVE New system → Correct AI
Audio Input
↓
Chunking (4s, 50% overlap)
↓
Stage 1 (Wav2Vec2 + MLP)
↓
Hand-crafted feature extraction
↓
AASIST (conditional verification)
↓
Confidence-weighted decision fusion
↓
Final Label + Explanation
The system produces structured layman explanations based on:
- Breathing detection
- Pitch variance
- Dynamics ratio
- Repetition score
- Vocoder artifacts
- Clipping
- Reverb
- Micro-noises
This makes the system explainable — critical for hackathon judging.
| Feature | Supported |
|---|---|
| Studio Human | ✅ |
| Low Quality Human | ✅ |
| Augmented AI | ✅ |
| Clean AI | ✅ |
| Edge Cases | INCONCLUSIVE |
| Confidence Output | Yes |
| Explainability | Yes |
It is a multi-stage adaptive AI verification system built to handle real-world edge cases. This system is LANGUAGE INDEPENDENT.