-
Notifications
You must be signed in to change notification settings - Fork 408
Add 'audio-text-to-text' task to Hugging Face Tasks #1479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add 'audio-text-to-text' task to Hugging Face Tasks #1479
Conversation
This commit introduces the new 'audio-text-to-text' task. The following has been added: - Directory structure for the task under `packages/tasks/src/tasks/audio-text-to-text/`. - `about.md` with task description, use cases, and Python/JS examples. - `data.ts` with metadata including example datasets, models, metrics, and demo definitions. - `spec/input.json` and `spec/output.json` defining the task's input and output schema. The main task registration file `packages/tasks/src/tasks/index.ts` has been updated to: - Import and include the 'audio-text-to-text' task data. - List relevant model libraries (`transformers`, `speechbrain`, `espnet`, `nemo`) for this task type. This task covers functionalities like automatic speech recognition (ASR) and speech translation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the PR @ProCreations-Official - in general it'd be good to add more relevant info to models that are capable of the Audio-text-to-text
task, so more relevant models would be:
Ultravox, Phi4, Qwen Audio etc.
cc: @Deep-unlearning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for working on this! left very general comments 🙂
@@ -0,0 +1,123 @@ | |||
## Audio Text to Text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## Audio Text to Text |
no need for this
@@ -0,0 +1,123 @@ | |||
## Audio Text to Text | |||
|
|||
The Audio Text to Text task (also sometimes referred to as speech-to-text, speech recognition, or speech translation depending on the specifics) converts audio input into textual output. This is a versatile task that can be used for various applications. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be in data.ts summary part
|
||
The Audio Text to Text task (also sometimes referred to as speech-to-text, speech recognition, or speech translation depending on the specifics) converts audio input into textual output. This is a versatile task that can be used for various applications. | ||
|
||
### Use Cases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### Use Cases | |
## Use Cases |
|
||
### Use Cases | ||
|
||
* **Speech Recognition:** Transcribing spoken language from an audio clip into text. This is foundational for voice assistants, dictation software, and transcribing meetings or interviews. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these can be separate headers instead of bullet points
* **Voice Command Interfaces:** Converting spoken commands into text that can then be interpreted by a system to perform actions (e.g., "Play music," "Set a timer"). | ||
* **Audio Event Description/Captioning:** Generating textual descriptions of sounds or events occurring in an audio stream (though this might sometimes overlap with Audio Tagging). | ||
|
||
### Python Examples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### Python Examples | |
## Inference | |
**Transformers** |
spaces: [ | ||
{ | ||
description: "A demonstration of the Whisper model for speech recognition.", | ||
id: "openai/whisper", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is ASR, not audio-text-to-text
}, | ||
{ | ||
description: "An ESPnet demo that can perform speech recognition and translation.", | ||
id: "espnet/espnet_asr_demo", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is ASR, not audio-text-to-text
}, | ||
{ | ||
description: "A model for translating speech from English to German (example of a speech translation model).", | ||
id: "facebook/s2t-medium-en-de-st", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is speech-to-text, not audio-text-to-text
models: [ | ||
{ | ||
description: "A popular multilingual model for automatic speech recognition.", | ||
id: "openai/whisper-base", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is ASR, not audio-text-to-text
@@ -119,7 +120,7 @@ export const TASKS_MODEL_LIBRARIES: Record<PipelineType, ModelLibraryKey[]> = { | |||
"audio-classification": ["speechbrain", "transformers", "transformers.js"], | |||
"audio-to-audio": ["asteroid", "fairseq", "speechbrain"], | |||
"automatic-speech-recognition": ["espnet", "nemo", "speechbrain", "transformers", "transformers.js"], | |||
"audio-text-to-text": [], | |||
"audio-text-to-text": ["transformers", "speechbrain", "espnet", "nemo"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think only transformers supports it
|
||
# Initialize the ASR pipeline | ||
# Replace "openai/whisper-base" with any ASR model of your choice | ||
asr_pipeline = pipeline("automatic-speech-recognition", model="openai/whisper-base") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Automatic Speech Recognition you can use QwenAudio or Granite Speech which are audio-text-to-text models
# Initialize the speech-to-text translation pipeline | ||
# Replace "facebook/s2t-small-librispeech-asr" with a speech translation model | ||
# For example, if you want to translate English audio to French text: | ||
translator_pipeline = pipeline("automatic-speech-recognition", model="facebook/s2t-small-en-fr-st") # Example model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for translation Granite Speech support translation En -> X
Co-authored-by: Steven Zheng <[email protected]>
This commit introduces the new 'audio-text-to-text' task.
The following has been added:
packages/tasks/src/tasks/audio-text-to-text/
.about.md
with task description, use cases, and Python/JS examples.data.ts
with metadata including example datasets, models, metrics, and demo definitions.spec/input.json
andspec/output.json
defining the task's input and output schema.The main task registration file
packages/tasks/src/tasks/index.ts
has been updated to:transformers
,speechbrain
,espnet
,nemo
) for this task type.This task covers functionalities like automatic speech recognition (ASR) and speech translation.