Improving Synthetic Speech Quality via SSML Prosody Control

📝 Abstract

This repository contains the code and models for the paper:

"Improving Synthetic Speech Quality via SSML Prosody Control"

We present a novel, end-to-end pipeline for enhancing the prosody of French synthetic speech using SSML (Speech Synthesis Markup Language) tags. Our approach leverages both supervised and large language model (LLM) methods to automatically annotate text with prosodic cues (pitch, volume, rate, and pauses), significantly improving the naturalness and expressiveness of TTS output.

🚀 Quick Links

Overview

Despite advances in TTS, synthetic French voices often lack natural prosody, especially in expressive contexts. This project provides:

🎵 SSML Annotation Pipeline (audioPipeline.py) for French speech
📊 Baseline Models (BERT, BiLSTM) for prosody and break prediction
🧠 LLM-based Models (zero-shot, few-shot, and cascaded Qwen)
📁 Example data and configuration for reproducible experiments

⚡Installation

We recommend using Ubuntu 22.04.3 or similar for best compatibility.

Clone the repository:

git clone https://github.com/hi-paris/Prosody-Control-French-TTS

Create the conda environment:

conda env create -f tts-env.yml
conda activate tts-env

Download required tools:
- Download the .rar archive from Google Drive
- Place it in a folder named Tools at the root of the repository (prosodyControl/Tools/)
Add your Azure TTS API key:
- at the root of the repository
- Paste your Azure API key into this file

Project Structure

prosodyControl/
│
├── Code/
│   ├── audioPipeline.py           # Main SSML pipeline
│   ├── audioPipeline_legacy.py    # Legacy pipeline scripts
│   ├── pipeline_class_legacy.py   # Legacy pipeline class
│   ├── prepare_AB_test.py         # AB test preparation script
│   ├── Aligners/                  # Alignment tools (Whisper, MFA, etc.)
│   ├── Pipeline/                  # Prosody extraction and processing modules
│   ├── Preprocessing/             # Audio and data preprocessing scripts
│   ├── baseline_models/           # Baseline BERT and BiLSTM models
│   └── ssml_models/               # Zero-shot, few-shot, and cascaded LLM models
│
├── Data/
│   └── voice/
│       └── records/
│           └── audio/             # Example segmented audio files
│
├── config.yaml                    # Main configuration file for the pipeline
├── tts-env.yml                    # Conda environment specification
├── Azure_API_key.txt              # Use environment variables instead
├── README.md                      # This file

Code/audioPipeline.py: The main entry point for the SSML annotation pipeline. All processing steps are managed here.
Code/Aligners/, Code/Pipeline/, Code/Preprocessing/: Contain scripts for alignment, prosody extraction, and preprocessing, used as part of the pipeline.
Code/baseline_models/: Implements the BERT and BiLSTM baselines referenced in the paper.
Code/ssml_models/: Contains our zero-shot, few-shot, and cascaded LLM approaches for SSML tag prediction.
Data/voice/records/audio/: Example segmented audio files for demonstration and testing.

🎮 Usage

All pipeline settings are controlled via config.yaml. This includes data paths, voice names, Azure TTS settings, prosody parameters, and which steps to run.

To run the full SSML annotation pipeline:

conda activate tts-env
python Code/audioPipeline.py

Adjust config.yaml as needed for your data and experiment.
The pipeline will process all voices specified in voice_names and execute the steps listed in steps_to_run.
Intermediate and final outputs (e.g., SSML, audio, CSVs) will be saved according to your configuration.

🤖 Models

Baselines: See Code/baseline_models/ for BERT and BiLSTM models for pause and prosody prediction.
LLM Approaches: See Code/ssml_models/ for zero-shot, few-shot, and cascaded Qwen-based models for SSML tag generation.

All models and scripts are referenced in the paper and can be used or extended for further research.

📚 Citation

Paper is available :

Improving French Synthetic Speech Quality via SSML Prosody Control

If you use this model, please cite the paper.

@inproceedings{ouali-etal-2025-improving,
    title = "Improving {F}rench Synthetic Speech Quality via {SSML} Prosody Control",
    author = "Ouali, Nassima Ould  and
      Sani, Awais Hussain  and
      Bueno, Ruben  and
      Dauvet, Jonah  and
      Horstmann, Tim Luka  and
      Moulines, Eric",
    editor = "Abbas, Mourad  and
      Yousef, Tariq  and
      Galke, Lukas",
    booktitle = "Proceedings of the 8th International Conference on Natural Language and Speech Processing (ICNLSP-2025)",
    month = aug,
    year = "2025",
    address = "Southern Denmark University, Odense, Denmark",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.icnlsp-1.30/",
    pages = "302--314"
}

⭐ Don't forget to star this repo if you find it useful!

```

License

This project is licensed under the MIT License. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Improving Synthetic Speech Quality via SSML Prosody Control

📝 Abstract

🚀 Quick Links

Table of Contents

Overview

⚡Installation

Project Structure

🎮 Usage

🤖 Models

📚 Citation

License

📬 Contact

Nassima Ould-Ouali

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
Code		Code
Data/voice/records/audio		Data/voice/records/audio
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
tts-env.yml		tts-env.yml

License

hi-paris/Prosody-Control-French-TTS

Folders and files

Latest commit

History

Repository files navigation

Improving Synthetic Speech Quality via SSML Prosody Control

📝 Abstract

🚀 Quick Links

Table of Contents

Overview

⚡Installation

Project Structure

🎮 Usage

🤖 Models

📚 Citation

License

📬 Contact

Nassima Ould-Ouali

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages