Add 'audio-text-to-text' task to Hugging Face Tasks #1479

ProCreations-Official · 2025-05-23T01:06:31Z

This commit introduces the new 'audio-text-to-text' task.

The following has been added:

Directory structure for the task under packages/tasks/src/tasks/audio-text-to-text/.
about.md with task description, use cases, and Python/JS examples.
data.ts with metadata including example datasets, models, metrics, and demo definitions.
spec/input.json and spec/output.json defining the task's input and output schema.

The main task registration file packages/tasks/src/tasks/index.ts has been updated to:

Import and include the 'audio-text-to-text' task data.
List relevant model libraries (transformers, speechbrain, espnet, nemo) for this task type.

This task covers functionalities like automatic speech recognition (ASR) and speech translation.

This commit introduces the new 'audio-text-to-text' task. The following has been added: - Directory structure for the task under `packages/tasks/src/tasks/audio-text-to-text/`. - `about.md` with task description, use cases, and Python/JS examples. - `data.ts` with metadata including example datasets, models, metrics, and demo definitions. - `spec/input.json` and `spec/output.json` defining the task's input and output schema. The main task registration file `packages/tasks/src/tasks/index.ts` has been updated to: - Import and include the 'audio-text-to-text' task data. - List relevant model libraries (`transformers`, `speechbrain`, `espnet`, `nemo`) for this task type. This task covers functionalities like automatic speech recognition (ASR) and speech translation.

Vaibhavs10

Thanks a lot for the PR @ProCreations-Official - in general it'd be good to add more relevant info to models that are capable of the Audio-text-to-text task, so more relevant models would be:

Ultravox, Phi4, Qwen Audio etc.

cc: @Deep-unlearning

merveenoyan

Thanks a lot for working on this! left very general comments 🙂

merveenoyan · 2025-05-23T09:27:50Z

packages/tasks/src/tasks/audio-text-to-text/about.md

@@ -0,0 +1,123 @@
+## Audio Text to Text


Suggested change

## Audio Text to Text

no need for this

merveenoyan · 2025-05-23T09:29:16Z

packages/tasks/src/tasks/audio-text-to-text/about.md

@@ -0,0 +1,123 @@
+## Audio Text to Text
+
+The Audio Text to Text task (also sometimes referred to as speech-to-text, speech recognition, or speech translation depending on the specifics) converts audio input into textual output. This is a versatile task that can be used for various applications.


this should be in data.ts summary part

merveenoyan · 2025-05-23T09:29:25Z

packages/tasks/src/tasks/audio-text-to-text/about.md

+
+The Audio Text to Text task (also sometimes referred to as speech-to-text, speech recognition, or speech translation depending on the specifics) converts audio input into textual output. This is a versatile task that can be used for various applications.
+
+### Use Cases


Suggested change

### Use Cases

## Use Cases

merveenoyan · 2025-05-23T09:29:40Z

packages/tasks/src/tasks/audio-text-to-text/about.md

+
+### Use Cases
+
+*   **Speech Recognition:** Transcribing spoken language from an audio clip into text. This is foundational for voice assistants, dictation software, and transcribing meetings or interviews.


these can be separate headers instead of bullet points

merveenoyan · 2025-05-23T09:30:05Z

packages/tasks/src/tasks/audio-text-to-text/about.md

+*   **Voice Command Interfaces:** Converting spoken commands into text that can then be interpreted by a system to perform actions (e.g., "Play music," "Set a timer").
+*   **Audio Event Description/Captioning:** Generating textual descriptions of sounds or events occurring in an audio stream (though this might sometimes overlap with Audio Tagging).
+
+### Python Examples


Suggested change

### Python Examples

## Inference

**Transformers**

merveenoyan · 2025-05-23T09:36:39Z

packages/tasks/src/tasks/audio-text-to-text/data.ts

+	spaces: [
+		{
+			description: "A demonstration of the Whisper model for speech recognition.",
+			id: "openai/whisper",


This is ASR, not audio-text-to-text

merveenoyan · 2025-05-23T09:36:41Z

packages/tasks/src/tasks/audio-text-to-text/data.ts

+		},
+		{
+			description: "An ESPnet demo that can perform speech recognition and translation.",
+			id: "espnet/espnet_asr_demo",


This is ASR, not audio-text-to-text

merveenoyan · 2025-05-23T09:36:53Z

packages/tasks/src/tasks/audio-text-to-text/data.ts

+		},
+		{
+			description: "A model for translating speech from English to German (example of a speech translation model).",
+			id: "facebook/s2t-medium-en-de-st",


this is speech-to-text, not audio-text-to-text

merveenoyan · 2025-05-23T09:36:56Z

packages/tasks/src/tasks/audio-text-to-text/data.ts

+	models: [
+		{
+			description: "A popular multilingual model for automatic speech recognition.",
+			id: "openai/whisper-base",


This is ASR, not audio-text-to-text

merveenoyan · 2025-05-23T09:37:23Z

packages/tasks/src/tasks/index.ts

@@ -119,7 +120,7 @@ export const TASKS_MODEL_LIBRARIES: Record<PipelineType, ModelLibraryKey[]> = {
 	"audio-classification": ["speechbrain", "transformers", "transformers.js"],
 	"audio-to-audio": ["asteroid", "fairseq", "speechbrain"],
 	"automatic-speech-recognition": ["espnet", "nemo", "speechbrain", "transformers", "transformers.js"],
-	"audio-text-to-text": [],
+	"audio-text-to-text": ["transformers", "speechbrain", "espnet", "nemo"],


I think only transformers supports it

Deep-unlearning · 2025-05-23T15:00:48Z

packages/tasks/src/tasks/audio-text-to-text/about.md

+
+# Initialize the ASR pipeline
+# Replace "openai/whisper-base" with any ASR model of your choice
+asr_pipeline = pipeline("automatic-speech-recognition", model="openai/whisper-base")


For Automatic Speech Recognition you can use QwenAudio or Granite Speech which are audio-text-to-text models

Deep-unlearning · 2025-05-23T15:02:11Z

packages/tasks/src/tasks/audio-text-to-text/about.md

+# Initialize the speech-to-text translation pipeline
+# Replace "facebook/s2t-small-librispeech-asr" with a speech translation model
+# For example, if you want to translate English audio to French text:
+translator_pipeline = pipeline("automatic-speech-recognition", model="facebook/s2t-small-en-fr-st") # Example model


Same for translation Granite Speech support translation En -> X

packages/tasks/src/tasks/audio-text-to-text/about.md

Co-authored-by: Steven Zheng <[email protected]>

ProCreations-Official requested review from SBrandeis, gary149, Wauplin, julien-c, pcuenca and ngxson as code owners May 23, 2025 01:06

Merge branch 'main' into feat/add-audio-text-to-text-task

1962aeb

Vaibhavs10 requested review from merveenoyan and Deep-unlearning May 23, 2025 09:07

Vaibhavs10 reviewed May 23, 2025

View reviewed changes

merveenoyan reviewed May 23, 2025

View reviewed changes

Merge branch 'main' into feat/add-audio-text-to-text-task

e86b270

Deep-unlearning reviewed May 23, 2025

View reviewed changes

packages/tasks/src/tasks/audio-text-to-text/about.md Outdated Show resolved Hide resolved

Update packages/tasks/src/tasks/audio-text-to-text/about.md

b64ad84

Co-authored-by: Steven Zheng <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add 'audio-text-to-text' task to Hugging Face Tasks #1479

Add 'audio-text-to-text' task to Hugging Face Tasks #1479

Uh oh!

ProCreations-Official commented May 23, 2025

Uh oh!

Vaibhavs10 left a comment

Uh oh!

merveenoyan left a comment

Uh oh!

merveenoyan May 23, 2025

Uh oh!

merveenoyan May 23, 2025

Uh oh!

merveenoyan May 23, 2025

Uh oh!

merveenoyan May 23, 2025

Uh oh!

merveenoyan May 23, 2025

Uh oh!

merveenoyan May 23, 2025

Uh oh!

merveenoyan May 23, 2025

Uh oh!

merveenoyan May 23, 2025

Uh oh!

merveenoyan May 23, 2025

Uh oh!

merveenoyan May 23, 2025

Uh oh!

Deep-unlearning May 23, 2025

Uh oh!

Deep-unlearning May 23, 2025

Uh oh!

Uh oh!

Uh oh!

		@@ -0,0 +1,123 @@
		## Audio Text to Text

		The Audio Text to Text task (also sometimes referred to as speech-to-text, speech recognition, or speech translation depending on the specifics) converts audio input into textual output. This is a versatile task that can be used for various applications.


		The Audio Text to Text task (also sometimes referred to as speech-to-text, speech recognition, or speech translation depending on the specifics) converts audio input into textual output. This is a versatile task that can be used for various applications.

		### Use Cases


		### Use Cases

		* Speech Recognition: Transcribing spoken language from an audio clip into text. This is foundational for voice assistants, dictation software, and transcribing meetings or interviews.

-### Python Examples
+## Inference
+**Transformers**

Add 'audio-text-to-text' task to Hugging Face Tasks #1479

Are you sure you want to change the base?

Add 'audio-text-to-text' task to Hugging Face Tasks #1479

Uh oh!

Conversation

ProCreations-Official commented May 23, 2025

Uh oh!

Vaibhavs10 left a comment

Choose a reason for hiding this comment

Uh oh!

merveenoyan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!