Skip to content

Add 'audio-text-to-text' task to Hugging Face Tasks #1479

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

ProCreations-Official
Copy link

This commit introduces the new 'audio-text-to-text' task.

The following has been added:

  • Directory structure for the task under packages/tasks/src/tasks/audio-text-to-text/.
  • about.md with task description, use cases, and Python/JS examples.
  • data.ts with metadata including example datasets, models, metrics, and demo definitions.
  • spec/input.json and spec/output.json defining the task's input and output schema.

The main task registration file packages/tasks/src/tasks/index.ts has been updated to:

  • Import and include the 'audio-text-to-text' task data.
  • List relevant model libraries (transformers, speechbrain, espnet, nemo) for this task type.

This task covers functionalities like automatic speech recognition (ASR) and speech translation.

This commit introduces the new 'audio-text-to-text' task.

The following has been added:
- Directory structure for the task under `packages/tasks/src/tasks/audio-text-to-text/`.
- `about.md` with task description, use cases, and Python/JS examples.
- `data.ts` with metadata including example datasets, models, metrics, and demo definitions.
- `spec/input.json` and `spec/output.json` defining the task's input and output schema.

The main task registration file `packages/tasks/src/tasks/index.ts` has been updated to:
- Import and include the 'audio-text-to-text' task data.
- List relevant model libraries (`transformers`, `speechbrain`, `espnet`, `nemo`) for this task type.

This task covers functionalities like automatic speech recognition (ASR) and speech translation.
Copy link
Member

@Vaibhavs10 Vaibhavs10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the PR @ProCreations-Official - in general it'd be good to add more relevant info to models that are capable of the Audio-text-to-text task, so more relevant models would be:

Ultravox, Phi4, Qwen Audio etc.

cc: @Deep-unlearning

Copy link
Contributor

@merveenoyan merveenoyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for working on this! left very general comments 🙂

@@ -0,0 +1,123 @@
## Audio Text to Text
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Audio Text to Text

no need for this

@@ -0,0 +1,123 @@
## Audio Text to Text

The Audio Text to Text task (also sometimes referred to as speech-to-text, speech recognition, or speech translation depending on the specifics) converts audio input into textual output. This is a versatile task that can be used for various applications.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be in data.ts summary part


The Audio Text to Text task (also sometimes referred to as speech-to-text, speech recognition, or speech translation depending on the specifics) converts audio input into textual output. This is a versatile task that can be used for various applications.

### Use Cases
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Use Cases
## Use Cases


### Use Cases

* **Speech Recognition:** Transcribing spoken language from an audio clip into text. This is foundational for voice assistants, dictation software, and transcribing meetings or interviews.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these can be separate headers instead of bullet points

* **Voice Command Interfaces:** Converting spoken commands into text that can then be interpreted by a system to perform actions (e.g., "Play music," "Set a timer").
* **Audio Event Description/Captioning:** Generating textual descriptions of sounds or events occurring in an audio stream (though this might sometimes overlap with Audio Tagging).

### Python Examples
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Python Examples
## Inference
**Transformers**

spaces: [
{
description: "A demonstration of the Whisper model for speech recognition.",
id: "openai/whisper",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ASR, not audio-text-to-text

},
{
description: "An ESPnet demo that can perform speech recognition and translation.",
id: "espnet/espnet_asr_demo",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ASR, not audio-text-to-text

},
{
description: "A model for translating speech from English to German (example of a speech translation model).",
id: "facebook/s2t-medium-en-de-st",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is speech-to-text, not audio-text-to-text

models: [
{
description: "A popular multilingual model for automatic speech recognition.",
id: "openai/whisper-base",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ASR, not audio-text-to-text

@@ -119,7 +120,7 @@ export const TASKS_MODEL_LIBRARIES: Record<PipelineType, ModelLibraryKey[]> = {
"audio-classification": ["speechbrain", "transformers", "transformers.js"],
"audio-to-audio": ["asteroid", "fairseq", "speechbrain"],
"automatic-speech-recognition": ["espnet", "nemo", "speechbrain", "transformers", "transformers.js"],
"audio-text-to-text": [],
"audio-text-to-text": ["transformers", "speechbrain", "espnet", "nemo"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think only transformers supports it


# Initialize the ASR pipeline
# Replace "openai/whisper-base" with any ASR model of your choice
asr_pipeline = pipeline("automatic-speech-recognition", model="openai/whisper-base")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Automatic Speech Recognition you can use QwenAudio or Granite Speech which are audio-text-to-text models

# Initialize the speech-to-text translation pipeline
# Replace "facebook/s2t-small-librispeech-asr" with a speech translation model
# For example, if you want to translate English audio to French text:
translator_pipeline = pipeline("automatic-speech-recognition", model="facebook/s2t-small-en-fr-st") # Example model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for translation Granite Speech support translation En -> X

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants