Skip to content

Add 'audio-text-to-text' task to Hugging Face Tasks #1479

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
123 changes: 123 additions & 0 deletions packages/tasks/src/tasks/audio-text-to-text/about.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
## Audio Text to Text
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Audio Text to Text

no need for this


The Audio Text to Text task (also sometimes referred to as speech-to-text, speech recognition, or speech translation depending on the specifics) converts audio input into textual output. This is a versatile task that can be used for various applications.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be in data.ts summary part


### Use Cases
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Use Cases
## Use Cases


* **Speech Recognition:** Transcribing spoken language from an audio clip into text. This is foundational for voice assistants, dictation software, and transcribing meetings or interviews.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these can be separate headers instead of bullet points

* **Speech Translation:** Directly translating spoken language from an audio clip in one language into text in another language. This is useful for real-time translation applications or translating audio content.
* **Voice Command Interfaces:** Converting spoken commands into text that can then be interpreted by a system to perform actions (e.g., "Play music," "Set a timer").
* **Audio Event Description/Captioning:** Generating textual descriptions of sounds or events occurring in an audio stream (though this might sometimes overlap with Audio Tagging).

### Python Examples
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Python Examples
## Inference
**Transformers**


You can use the `transformers` library for many audio-text-to-text tasks.

**1. Automatic Speech Recognition (ASR):**

```python
from transformers import pipeline
import torchaudio

# Initialize the ASR pipeline
# Replace "openai/whisper-base" with any ASR model of your choice
asr_pipeline = pipeline("automatic-speech-recognition", model="openai/whisper-base")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Automatic Speech Recognition you can use QwenAudio or Granite Speech which are audio-text-to-text models


# Load an example audio file (you'll need to have one)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we have a lot of comments, best to turn into md here :)

# For example, using torchaudio to load and resample if needed
# waveform, sample_rate = torchaudio.load("your_audio_file.wav")
# if sample_rate != asr_pipeline.feature_extractor.sampling_rate:
# resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=asr_pipeline.feature_extractor.sampling_rate)
# waveform = resampler(waveform)

# Or provide a path directly (ensure it's in a supported format and sample rate)
# For demonstration, let's assume you have a file "sample_audio.flac"
# If you don't have an audio file handy, you can skip loading and pass dummy data or a public URL if the model supports it.
# For local files, it's usually best to load them as numpy arrays or torch tensors.

# Example with a local file path (ensure the file exists and is accessible)
# text_output = asr_pipeline("path/to/your/sample_audio.flac")
# print(text_output)
# Expected output: {'text': 'TRANSCRIPTION OF THE AUDIO...'}

# Example with a publicly accessible URL (model/pipeline must support it)
# text_output_url = asr_pipeline("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac")
# print(text_output_url)
# Expected output: {'text': ' HELLO MY NAME IS NARSEEL'}
```
*Note: For local audio files, you might need to load and preprocess them into the format expected by the pipeline (e.g., a NumPy array or Torch tensor of raw audio samples). Ensure the sampling rate matches the model's requirements.*

**2. Speech Translation (Example with a model fine-tuned for S2T):**
Speech-to-Text (S2T) models can directly translate audio from one language to text in another.

```python
from transformers import pipeline

# Initialize the speech-to-text translation pipeline
# Replace "facebook/s2t-small-librispeech-asr" with a speech translation model
# For example, if you want to translate English audio to French text:
translator_pipeline = pipeline("automatic-speech-recognition", model="facebook/s2t-small-en-fr-st") # Example model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for translation Granite Speech support translation En -> X


# Process an audio file (similar to ASR)
# audio_input = "path/to/your/english_audio.wav"
# translated_text = translator_pipeline(audio_input)
# print(translated_text)
# Expected output: {'text': 'FRENCH TRANSLATION OF THE AUDIO...'}

# Example with a publicly accessible URL (model/pipeline must support it)
# For S2T, ensure the audio is in the source language the model expects.
# This example uses an ASR model URL, adapt if a direct S2T URL is available.
# translated_text_url = translator_pipeline("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac") # Assuming English audio
# print(translated_text_url)
# Expected output: {'text': 'BONJOUR MON NOM EST NARSEEL'} (if input was "Hello my name is Narsil")
```

### JavaScript Example
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### JavaScript Example
**huggingface.js**


You can use [`@huggingface/inference`](https://github.com/huggingface/huggingface.js) to perform audio-text-to-text tasks with models on the Hugging Face Hub.

```javascript
import { InferenceClient } from "@huggingface/inference";

const inference = new InferenceClient(HF_TOKEN); // Your Hugging Face token

async function transcribeAudio(audioBlob) {
try {
const result = await inference.automaticSpeechRecognition({
model: "openai/whisper-base", // Or your preferred ASR/S2T model
data: audioBlob,
});
console.log(result.text);
return result.text;
} catch (error) {
console.error("Error during transcription:", error);
}
}

// Example usage:
// Assumes you have an audio file as a Blob object (e.g., from a file input)
// const audioFile = new File(["...audio data..."], "audio.wav", { type: "audio/wav" });
// transcribeAudio(audioFile);

// Example fetching a remote audio file and then transcribing:
async function transcribeRemoteAudio(audioUrl) {
try {
const response = await fetch(audioUrl);
if (!response.ok) {
throw new Error(`HTTP error! status: ${response.status}`);
}
const audioBlob = await response.blob();

const result = await inference.automaticSpeechRecognition({
model: "openai/whisper-base", // Or your preferred ASR/S2T model
data: audioBlob,
});
console.log("Transcription:", result.text);
return result.text;
} catch (error) {
console.error("Error during remote audio transcription:", error);
}
}

// transcribeRemoteAudio("httpsS://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac");
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add some sources here like further reading

65 changes: 65 additions & 0 deletions packages/tasks/src/tasks/audio-text-to-text/data.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
import type { TaskDataCustom } from "../index.js";

const taskData: TaskDataCustom = {
datasets: [
{
description: "A massively multilingual speech corpus, excellent for training speech recognition models.",
id: "mozilla-foundation/common_voice_11_0", // Mozilla Common Voice (example)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
id: "mozilla-foundation/common_voice_11_0", // Mozilla Common Voice (example)
id: "mozilla-foundation/common_voice_11_0",

no need for comments

},
{
description: "A benchmark dataset for speech translation.",
id: "facebook/covost2", // CoVoST 2 (example for speech translation)
},
],
demo: {
inputs: [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with audio-text-to-text models I think you also need a text prompt yes?

{
filename: "input.flac",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to open a PR here with input file https://huggingface.co/datasets/huggingfacejs/tasks

type: "audio",
},
],
outputs: [
{
label: "Output", // Generic label, will be "Transcription" or "Translation"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
label: "Output", // Generic label, will be "Transcription" or "Translation"
label: "Output",

You can actually write the label

content: "This is a sample transcription or translation from the audio.",
type: "text",
},
],
},
metrics: [
{
description: "Word Error Rate (WER) is a common metric for the accuracy of an automatic speech recognition system. The lower the WER, the better.",
id: "wer",
},
{
description: "BLEU (Bilingual Evaluation Understudy) score is often used to measure the quality of machine translation from one language to another.",
id: "bleu",
},
],
models: [
{
description: "A popular multilingual model for automatic speech recognition.",
id: "openai/whisper-base",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ASR, not audio-text-to-text

},
{
description: "A model for translating speech from English to German (example of a speech translation model).",
id: "facebook/s2t-medium-en-de-st",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is speech-to-text, not audio-text-to-text

},
],
spaces: [
{
description: "A demonstration of the Whisper model for speech recognition.",
id: "openai/whisper",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ASR, not audio-text-to-text

},
{
description: "An ESPnet demo that can perform speech recognition and translation.",
id: "espnet/espnet_asr_demo",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ASR, not audio-text-to-text

},
],
summary:
"Audio Text to Text tasks convert audio input into textual output. This primarily includes automatic speech recognition (transcribing audio to text in the same language) and speech translation (translating audio in one language to text in another).",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Audio Text to Text tasks convert audio input into textual output. This primarily includes automatic speech recognition (transcribing audio to text in the same language) and speech translation (translating audio in one language to text in another).",
"Audio Text to Text tasks convert audio input into textual output. This primarily includes automatic speech recognition (transcribing audio to text in the same language) and speech translation (translating audio in one language to text in another).",

these models can take open ended sound file + text prompt and then output text, like voice chat
a good example is this one https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct

widgetModels: ["openai/whisper-base"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not an audio-text-to-text model, if it doesn't exist in widgets we can leave it empty

youtubeId: "SqE7xeyjBFg", // Example: A video about Whisper
};

export default taskData;
24 changes: 24 additions & 0 deletions packages/tasks/src/tasks/audio-text-to-text/spec/input.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "AudioTextToTextInput",
"type": "object",
"properties": {
"inputs": {
"type": "string",
"format": "binary",
"description": "The audio input to be processed."
},
"parameters": {
"type": "object",
"properties": {
"generate_kwargs": {
"type": "object",
"description": "Keyword arguments to control generation. Varies by model."
}
}
}
},
"required": [
"inputs"
]
}
17 changes: 17 additions & 0 deletions packages/tasks/src/tasks/audio-text-to-text/spec/output.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "AudioTextToTextOutput",
"type": "array",
"items": {
"type": "object",
"properties": {
"text": {
"type": "string",
"description": "The generated text from the audio input."
}
},
"required": [
"text"
]
}
}
5 changes: 3 additions & 2 deletions packages/tasks/src/tasks/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ import imageTo3D from "./image-to-3d/data.js";
import textTo3D from "./text-to-3d/data.js";
import keypointDetection from "./keypoint-detection/data.js";
import videoTextToText from "./video-text-to-text/data.js";
import audioTextToText from "./audio-text-to-text/data.js";

export type * from "./audio-classification/inference.js";
export type * from "./automatic-speech-recognition/inference.js";
Expand Down Expand Up @@ -121,7 +122,7 @@ export const TASKS_MODEL_LIBRARIES: Record<PipelineType, ModelLibraryKey[]> = {
"audio-classification": ["speechbrain", "transformers", "transformers.js"],
"audio-to-audio": ["asteroid", "fairseq", "speechbrain"],
"automatic-speech-recognition": ["espnet", "nemo", "speechbrain", "transformers", "transformers.js"],
"audio-text-to-text": [],
"audio-text-to-text": ["transformers", "speechbrain", "espnet", "nemo"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think only transformers supports it

"depth-estimation": ["transformers", "transformers.js"],
"document-question-answering": ["transformers", "transformers.js"],
"feature-extraction": ["sentence-transformers", "transformers", "transformers.js"],
Expand Down Expand Up @@ -205,7 +206,7 @@ export const TASKS_DATA: Record<PipelineType, TaskData | undefined> = {
"any-to-any": getData("any-to-any", anyToAny),
"audio-classification": getData("audio-classification", audioClassification),
"audio-to-audio": getData("audio-to-audio", audioToAudio),
"audio-text-to-text": getData("audio-text-to-text", placeholder),
"audio-text-to-text": getData("audio-text-to-text", audioTextToText),
"automatic-speech-recognition": getData("automatic-speech-recognition", automaticSpeechRecognition),
"depth-estimation": getData("depth-estimation", depthEstimation),
"document-question-answering": getData("document-question-answering", documentQuestionAnswering),
Expand Down