This repository contains the code and models for the paper:
"Improving Synthetic Speech Quality via SSML Prosody Control"
We present a novel, end-to-end pipeline for enhancing the prosody of French synthetic speech using SSML (Speech Synthesis Markup Language) tags. Our approach leverages both supervised and large language model (LLM) methods to automatically annotate text with prosodic cues (pitch, volume, rate, and pauses), significantly improving the naturalness and expressiveness of TTS output.
- 🌟 Overview
- ⚡ Installation
- 🏗️ Project Structure
- 🎮 Usage
- 🤖 Models
- 📊 Demo
- 📚 Citation
- 📄 License
- 📬 Contact
Despite advances in TTS, synthetic French voices often lack natural prosody, especially in expressive contexts. This project provides:
- 🎵 SSML Annotation Pipeline (
audioPipeline.py) for French speech - 📊 Baseline Models (BERT, BiLSTM) for prosody and break prediction
- 🧠 LLM-based Models (zero-shot, few-shot, and cascaded Qwen)
- 📁 Example data and configuration for reproducible experiments
We recommend using Ubuntu 22.04.3 or similar for best compatibility.
-
Clone the repository:
git clone https://github.com/hi-paris/Prosody-Control-French-TTS
-
Create the conda environment:
conda env create -f tts-env.yml conda activate tts-env
-
Download required tools:
- Download the
.rararchive from Google Drive - Place it in a folder named
Toolsat the root of the repository (prosodyControl/Tools/)
- Download the
-
Add your Azure TTS API key:
- at the root of the repository
- Paste your Azure API key into this file
prosodyControl/
│
├── Code/
│ ├── audioPipeline.py # Main SSML pipeline
│ ├── audioPipeline_legacy.py # Legacy pipeline scripts
│ ├── pipeline_class_legacy.py # Legacy pipeline class
│ ├── prepare_AB_test.py # AB test preparation script
│ ├── Aligners/ # Alignment tools (Whisper, MFA, etc.)
│ ├── Pipeline/ # Prosody extraction and processing modules
│ ├── Preprocessing/ # Audio and data preprocessing scripts
│ ├── baseline_models/ # Baseline BERT and BiLSTM models
│ └── ssml_models/ # Zero-shot, few-shot, and cascaded LLM models
│
├── Data/
│ └── voice/
│ └── records/
│ └── audio/ # Example segmented audio files
│
├── config.yaml # Main configuration file for the pipeline
├── tts-env.yml # Conda environment specification
├── Azure_API_key.txt # Use environment variables instead
├── README.md # This file
Code/audioPipeline.py: The main entry point for the SSML annotation pipeline. All processing steps are managed here.Code/Aligners/,Code/Pipeline/,Code/Preprocessing/: Contain scripts for alignment, prosody extraction, and preprocessing, used as part of the pipeline.Code/baseline_models/: Implements the BERT and BiLSTM baselines referenced in the paper.Code/ssml_models/: Contains our zero-shot, few-shot, and cascaded LLM approaches for SSML tag prediction.Data/voice/records/audio/: Example segmented audio files for demonstration and testing.
All pipeline settings are controlled via config.yaml. This includes data paths, voice names, Azure TTS settings, prosody parameters, and which steps to run.
To run the full SSML annotation pipeline:
conda activate tts-env
python Code/audioPipeline.py- Adjust
config.yamlas needed for your data and experiment. - The pipeline will process all voices specified in
voice_namesand execute the steps listed insteps_to_run. - Intermediate and final outputs (e.g., SSML, audio, CSVs) will be saved according to your configuration.
- Baselines: See
Code/baseline_models/for BERT and BiLSTM models for pause and prosody prediction. - LLM Approaches: See
Code/ssml_models/for zero-shot, few-shot, and cascaded Qwen-based models for SSML tag generation.
All models and scripts are referenced in the paper and can be used or extended for further research.
Paper is available :
Improving French Synthetic Speech Quality via SSML Prosody Control
If you use this model, please cite the paper.
@inproceedings{ouali-etal-2025-improving,
title = "Improving {F}rench Synthetic Speech Quality via {SSML} Prosody Control",
author = "Ouali, Nassima Ould and
Sani, Awais Hussain and
Bueno, Ruben and
Dauvet, Jonah and
Horstmann, Tim Luka and
Moulines, Eric",
editor = "Abbas, Mourad and
Yousef, Tariq and
Galke, Lukas",
booktitle = "Proceedings of the 8th International Conference on Natural Language and Speech Processing (ICNLSP-2025)",
month = aug,
year = "2025",
address = "Southern Denmark University, Odense, Denmark",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.icnlsp-1.30/",
pages = "302--314"
}
⭐ Don't forget to star this repo if you find it useful!
This project is licensed under the MIT License. See the LICENSE file for details.