Skip to content

Multi-language capabilities for MatchaTTS#166

Open
fzalkow wants to merge 1 commit intoshivammehta25:mainfrom
fzalkow:main
Open

Multi-language capabilities for MatchaTTS#166
fzalkow wants to merge 1 commit intoshivammehta25:mainfrom
fzalkow:main

Conversation

@fzalkow
Copy link
Copy Markdown

@fzalkow fzalkow commented Dec 19, 2025

This pull request adds multi-language support to MatchaTTS by concatenating language embeddings to the encoder and decoder inputs.

Speaker and Language Disentanglement:
When training with only a few monolingual speakers, speaker and language IDs are highly correlated. Therefore, it is likely (though not tested) that the model may have difficulty disentangling them. With multilingual speakers or a sufficiently large number of speakers, the model can better separate speaker and language information.

How to Use:
Specify the number of languages using the n_langs key in the data config and annotate your data CSV as follows:

filepath spk lang text

Backward Compatibility:
If you set n_langs: 1, the system behaves as before this pull request, with one exception: both speaker and language embeddings are now L2-normalized (if used). In my experience, this helps maintain a balance between speaker and language information during training.

Note:
This pull request does not include any text processing functions for additional languages. You will need to ensure that your text preprocessing pipeline supports the languages you intend to use.

@scott-parkhill
Copy link
Copy Markdown

Would this PR allow us to train a multi-lingual model on typologically similar languages to potentially reduce the amount of training data we would need for these languages? I.e. if we have a lot of training data for one language in the family, and there is a topologically similar language for which we do not have much data, could the contents of this PR allow us to train a voice leveraging the model trained with a larger dataset?

@fzalkow
Copy link
Copy Markdown
Author

fzalkow commented Mar 11, 2026

Would this PR allow us to train a multi-lingual model on typologically similar languages to potentially reduce the amount of training data we would need for these languages? I.e. if we have a lot of training data for one language in the family, and there is a topologically similar language for which we do not have much data, could the contents of this PR allow us to train a voice leveraging the model trained with a larger dataset?

I have not tried this for this particular PR, but for other models with similar techniques, and there was a cross-language benefit. So, my guess is that the answer to your question is yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants