You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
You can customize the configuration file 'audio_service.yaml' to match your environment setup. Here's a table to help you understand the configurable options:
Copy file name to clipboardExpand all lines: intel_extension_for_transformers/neural_chat/pipeline/plugins/audio/README.md
+16-28Lines changed: 16 additions & 28 deletions
Original file line number
Diff line number
Diff line change
@@ -4,9 +4,9 @@ The Audio Processing and Text-to-Speech (TTS) Plugin is a software component des
4
4
5
5
-**Text-to-Speech (TTS) Conversion**: The TTS plugin can convert written text into natural-sounding speech by synthesizing human-like voices. Users can customize the voice, tone, and speed of the generated speech to suit their specific requirements.
6
6
7
-
-**Speech Recognition**: The ASR plugin support speech recognition, allowing it to transcribe spoken words into text. This can be used for applications like voice commands, transcription services, and voice-controlled interfaces. It supports both English and Chinese.
7
+
-**Audio Speech Recognition (ASR)**: The ASR plugin support speech recognition, allowing it to transcribe spoken words into text. This can be used for applications like voice commands, transcription services, and voice-controlled interfaces. It supports both English and Chinese.
8
8
9
-
-**Multi-Language Support**: The plugin typically supports multiple languages and accents, making it versatile for global applications and catering to diverse user bases. It supports both Englishand Chinese now.
9
+
-**Multi-Language Support**: The plugins typically support multiple languages and accents, making it versatile for global applications and catering to diverse user bases. The ASR plugin supports tens of languages that the Whisper model supports. The TTS plugin supports English, Chinese and Japanese currently.
10
10
11
11
-**Integration**: Developers can easily integrate this plugin into their applications or systems using APIs.
For other operating systems such as CentOS, you will need to make slight adjustments.
24
24
25
-
# Multilingual Automatic Speech Recognition (ASR)
25
+
# Multi Language Automatic Speech Recognition (ASR)
26
26
27
-
## Dependencies Installation
28
-
29
-
To use the ASR module, you need to install the necessary dependencies. You can do this by running the following command:
30
-
31
-
```bash
32
-
pip install transformers datasets pydub
33
-
```
27
+
We support multi-language Automatic Speech Recognition using Whisper.
34
28
35
29
## Usage
36
30
37
-
The AudioSpeechRecognition class provides functionality for converting English/Multiligual audio to text. Here's how to use it:
31
+
The AudioSpeechRecognition class provides functionality for converting multi-language audio to text. Here's how to use it:
38
32
39
33
```python
40
34
from intel_extension_for_transformers.neural_chat.pipeline.plugins.audio.asr import AudioSpeechRecognition
41
35
# pass the parameter language="auto" to let the asr model automatically detect language
42
-
# otherwise, you can pass an arbitrary language to the model (e.g. en/zh/de/fr)
36
+
# otherwise, you can pass an arbitrary language to the model (e.g. en/zh/de/fr...)
43
37
asr = AudioSpeechRecognition("openai/whisper-small", language="auto", device=self.device)
44
38
audio_path ="~/audio.wav"# Replace with the path to your English audio file (supports MP3 and WAV)
45
39
result = asr.audio2text(audio_path)
@@ -49,6 +43,7 @@ print("ASR Result:", result)
49
43
50
44
# English Text-to-Speech (TTS)
51
45
46
+
We support English-only TTS based on [SpeechT5](https://arxiv.org/pdf/2110.07205.pdf) and its checkpoints are directly downloaded from [HuggingFace](https://huggingface.co/microsoft/speecht5_tts). It is a two-stage TTS model composed of an acoustic model and a vocoder, and it uses a speaker embedding to distinguish between different voices. In our early experiments and development, this model with the pretrained weights can output relatively good English-only audio results and do voice cloning with few-shot audios from new speakers.
52
47
## Dependencies Installation
53
48
54
49
To use the English TTS module, you need to install the required dependencies. Run the following command:
@@ -70,33 +65,26 @@ voice = "default" # You can choose between "default," "pat," or a custom voice
To use the Chinese TTS module, you need to install the required dependencies. Run the following command:
70
+
We support multi-language multi-speaker text to speech functionalities (Chinese, English, Japanese) on top of the project [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2), with [IPEX](https://github.com/intel/intel-extension-for-pytorch) BFloat16 inference optimization on Xeon CPU. We finetune our [checkpoints](https://huggingface.co/spycsh/bert-vits-thchs-6-8000) with partial data (6 speakers) from the audio dataset [THCHS-30](https://www.openslr.org/18/). It has a backbone of [VITS](https://arxiv.org/pdf/2106.06103.pdf) and VITS itself is an end-to-end TTS model. Together with Bert to convert the text embedding, VITS is proved to combine more complex latent text features with audios to obtain high-quality TTS results with multiple speakers' voices.
78
71
79
-
```bash
80
-
pip install paddlespeech paddlepaddle
81
-
```
82
72
83
73
## Usage
84
74
85
-
The ChineseTextToSpeech class within your module provides functionality for TTS. Here's how to use it:
75
+
The `MultilangTextToSpeech` class within your module provides functionality for TTS. Here's how to use it:
86
76
87
77
```python
88
-
from intel_extension_for_transformers.neural_chat.pipeline.plugins.audio.tts_chineseimportChineseTextToSpeech
78
+
from intel_extension_for_transformers.neural_chat.pipeline.plugins.audio.tts_multilangimportMultilangTextToSpeech
89
79
# Initialize the TTS module
90
-
tts =ChineseTextToSpeech()
80
+
tts =MultilangTextToSpeech()
91
81
# Define the text you want to convert to speech
92
-
text_to_speak ="你好,这是一个示例文本。"# Replace with your Chinese text
82
+
text_to_speak ="欢迎来到英特尔,welcome to Intel。こんにちは!"# Replace with your multi-language text
93
83
# Specify the output audio path
94
84
output_audio_path ="./output.wav"# Replace with your desired output audio path
95
85
# Perform text-to-speech conversion
96
-
tts.text2speech(text_to_speak)
86
+
tts.text2speech(text_to_speak, output_audio_path)
97
87
98
-
# If you want to stream the generation of audio from a text generator (e.g., a language model),
99
-
# you can use the following method:
100
-
# audio_generator = your_text_generator_function() # Replace with your text generator
101
-
# tts.stream_text2speech(audio_generator)
88
+
# If you want to change the speaker, change the sid
0 commit comments