-
Notifications
You must be signed in to change notification settings - Fork 411
Add 'audio-text-to-text' task to Hugging Face Tasks #1479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 3 commits
4184b12
1962aeb
e86b270
b64ad84
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,123 @@ | ||||||||||
## Audio Text to Text | ||||||||||
|
||||||||||
The Audio Text to Text task (also sometimes referred to as speech-to-text, speech recognition, or speech translation depending on the specifics) converts audio input into textual output. This is a versatile task that can be used for various applications. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this should be in data.ts summary part |
||||||||||
|
||||||||||
### Use Cases | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||
|
||||||||||
* **Speech Recognition:** Transcribing spoken language from an audio clip into text. This is foundational for voice assistants, dictation software, and transcribing meetings or interviews. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. these can be separate headers instead of bullet points |
||||||||||
* **Speech Translation:** Directly translating spoken language from an audio clip in one language into text in another language. This is useful for real-time translation applications or translating audio content. | ||||||||||
* **Voice Command Interfaces:** Converting spoken commands into text that can then be interpreted by a system to perform actions (e.g., "Play music," "Set a timer"). | ||||||||||
* **Audio Event Description/Captioning:** Generating textual descriptions of sounds or events occurring in an audio stream (though this might sometimes overlap with Audio Tagging). | ||||||||||
|
||||||||||
### Python Examples | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||
|
||||||||||
You can use the `transformers` library for many audio-text-to-text tasks. | ||||||||||
|
||||||||||
**1. Automatic Speech Recognition (ASR):** | ||||||||||
|
||||||||||
```python | ||||||||||
from transformers import pipeline | ||||||||||
import torchaudio | ||||||||||
|
||||||||||
# Initialize the ASR pipeline | ||||||||||
# Replace "openai/whisper-base" with any ASR model of your choice | ||||||||||
asr_pipeline = pipeline("automatic-speech-recognition", model="openai/whisper-base") | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For Automatic Speech Recognition you can use QwenAudio or Granite Speech which are audio-text-to-text models |
||||||||||
|
||||||||||
# Load an example audio file (you'll need to have one) | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if we have a lot of comments, best to turn into md here :) |
||||||||||
# For example, using torchaudio to load and resample if needed | ||||||||||
# waveform, sample_rate = torchaudio.load("your_audio_file.wav") | ||||||||||
# if sample_rate != asr_pipeline.feature_extractor.sampling_rate: | ||||||||||
# resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=asr_pipeline.feature_extractor.sampling_rate) | ||||||||||
# waveform = resampler(waveform) | ||||||||||
|
||||||||||
# Or provide a path directly (ensure it's in a supported format and sample rate) | ||||||||||
# For demonstration, let's assume you have a file "sample_audio.flac" | ||||||||||
# If you don't have an audio file handy, you can skip loading and pass dummy data or a public URL if the model supports it. | ||||||||||
# For local files, it's usually best to load them as numpy arrays or torch tensors. | ||||||||||
|
||||||||||
# Example with a local file path (ensure the file exists and is accessible) | ||||||||||
# text_output = asr_pipeline("path/to/your/sample_audio.flac") | ||||||||||
# print(text_output) | ||||||||||
# Expected output: {'text': 'TRANSCRIPTION OF THE AUDIO...'} | ||||||||||
|
||||||||||
# Example with a publicly accessible URL (model/pipeline must support it) | ||||||||||
# text_output_url = asr_pipeline("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac") | ||||||||||
# print(text_output_url) | ||||||||||
# Expected output: {'text': ' HELLO MY NAME IS NARSEEL'} | ||||||||||
``` | ||||||||||
*Note: For local audio files, you might need to load and preprocess them into the format expected by the pipeline (e.g., a NumPy array or Torch tensor of raw audio samples). Ensure the sampling rate matches the model's requirements.* | ||||||||||
|
||||||||||
**2. Speech Translation (Example with a model fine-tuned for S2T):** | ||||||||||
Speech-to-Text (S2T) models can directly translate audio from one language to text in another. | ||||||||||
|
||||||||||
```python | ||||||||||
from transformers import pipeline | ||||||||||
|
||||||||||
# Initialize the speech-to-text translation pipeline | ||||||||||
# Replace "facebook/s2t-small-librispeech-asr" with a speech translation model | ||||||||||
# For example, if you want to translate English audio to French text: | ||||||||||
translator_pipeline = pipeline("automatic-speech-recognition", model="facebook/s2t-small-en-fr-st") # Example model | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same for translation Granite Speech support translation En -> X |
||||||||||
|
||||||||||
# Process an audio file (similar to ASR) | ||||||||||
# audio_input = "path/to/your/english_audio.wav" | ||||||||||
# translated_text = translator_pipeline(audio_input) | ||||||||||
# print(translated_text) | ||||||||||
# Expected output: {'text': 'FRENCH TRANSLATION OF THE AUDIO...'} | ||||||||||
|
||||||||||
# Example with a publicly accessible URL (model/pipeline must support it) | ||||||||||
# For S2T, ensure the audio is in the source language the model expects. | ||||||||||
# This example uses an ASR model URL, adapt if a direct S2T URL is available. | ||||||||||
# translated_text_url = translator_pipeline("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac") # Assuming English audio | ||||||||||
# print(translated_text_url) | ||||||||||
# Expected output: {'text': 'BONJOUR MON NOM EST NARSEEL'} (if input was "Hello my name is Narsil") | ||||||||||
``` | ||||||||||
|
||||||||||
### JavaScript Example | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||
|
||||||||||
You can use [`@huggingface/inference`](https://github.com/huggingface/huggingface.js) to perform audio-text-to-text tasks with models on the Hugging Face Hub. | ||||||||||
|
||||||||||
```javascript | ||||||||||
import { InferenceClient } from "@huggingface/inference"; | ||||||||||
|
||||||||||
const inference = new InferenceClient(HF_TOKEN); // Your Hugging Face token | ||||||||||
|
||||||||||
async function transcribeAudio(audioBlob) { | ||||||||||
try { | ||||||||||
const result = await inference.automaticSpeechRecognition({ | ||||||||||
model: "openai/whisper-base", // Or your preferred ASR/S2T model | ||||||||||
data: audioBlob, | ||||||||||
}); | ||||||||||
console.log(result.text); | ||||||||||
return result.text; | ||||||||||
} catch (error) { | ||||||||||
console.error("Error during transcription:", error); | ||||||||||
} | ||||||||||
} | ||||||||||
|
||||||||||
// Example usage: | ||||||||||
// Assumes you have an audio file as a Blob object (e.g., from a file input) | ||||||||||
// const audioFile = new File(["...audio data..."], "audio.wav", { type: "audio/wav" }); | ||||||||||
// transcribeAudio(audioFile); | ||||||||||
|
||||||||||
// Example fetching a remote audio file and then transcribing: | ||||||||||
async function transcribeRemoteAudio(audioUrl) { | ||||||||||
try { | ||||||||||
const response = await fetch(audioUrl); | ||||||||||
if (!response.ok) { | ||||||||||
throw new Error(`HTTP error! status: ${response.status}`); | ||||||||||
} | ||||||||||
const audioBlob = await response.blob(); | ||||||||||
|
||||||||||
const result = await inference.automaticSpeechRecognition({ | ||||||||||
model: "openai/whisper-base", // Or your preferred ASR/S2T model | ||||||||||
data: audioBlob, | ||||||||||
}); | ||||||||||
console.log("Transcription:", result.text); | ||||||||||
return result.text; | ||||||||||
} catch (error) { | ||||||||||
console.error("Error during remote audio transcription:", error); | ||||||||||
} | ||||||||||
} | ||||||||||
|
||||||||||
// transcribeRemoteAudio("httpsS://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac"); | ||||||||||
``` | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can add some sources here like further reading
ProCreations-Official marked this conversation as resolved.
Show resolved
Hide resolved
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,65 @@ | ||||||
import type { TaskDataCustom } from "../index.js"; | ||||||
|
||||||
const taskData: TaskDataCustom = { | ||||||
datasets: [ | ||||||
{ | ||||||
description: "A massively multilingual speech corpus, excellent for training speech recognition models.", | ||||||
id: "mozilla-foundation/common_voice_11_0", // Mozilla Common Voice (example) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
no need for comments |
||||||
}, | ||||||
{ | ||||||
description: "A benchmark dataset for speech translation.", | ||||||
id: "facebook/covost2", // CoVoST 2 (example for speech translation) | ||||||
}, | ||||||
], | ||||||
demo: { | ||||||
inputs: [ | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. with audio-text-to-text models I think you also need a text prompt yes? |
||||||
{ | ||||||
filename: "input.flac", | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Feel free to open a PR here with input file https://huggingface.co/datasets/huggingfacejs/tasks |
||||||
type: "audio", | ||||||
}, | ||||||
], | ||||||
outputs: [ | ||||||
{ | ||||||
label: "Output", // Generic label, will be "Transcription" or "Translation" | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
You can actually write the label |
||||||
content: "This is a sample transcription or translation from the audio.", | ||||||
type: "text", | ||||||
}, | ||||||
], | ||||||
}, | ||||||
metrics: [ | ||||||
{ | ||||||
description: "Word Error Rate (WER) is a common metric for the accuracy of an automatic speech recognition system. The lower the WER, the better.", | ||||||
id: "wer", | ||||||
}, | ||||||
{ | ||||||
description: "BLEU (Bilingual Evaluation Understudy) score is often used to measure the quality of machine translation from one language to another.", | ||||||
id: "bleu", | ||||||
}, | ||||||
], | ||||||
models: [ | ||||||
{ | ||||||
description: "A popular multilingual model for automatic speech recognition.", | ||||||
id: "openai/whisper-base", | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is ASR, not audio-text-to-text |
||||||
}, | ||||||
{ | ||||||
description: "A model for translating speech from English to German (example of a speech translation model).", | ||||||
id: "facebook/s2t-medium-en-de-st", | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is speech-to-text, not audio-text-to-text |
||||||
}, | ||||||
], | ||||||
spaces: [ | ||||||
{ | ||||||
description: "A demonstration of the Whisper model for speech recognition.", | ||||||
id: "openai/whisper", | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is ASR, not audio-text-to-text |
||||||
}, | ||||||
{ | ||||||
description: "An ESPnet demo that can perform speech recognition and translation.", | ||||||
id: "espnet/espnet_asr_demo", | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is ASR, not audio-text-to-text |
||||||
}, | ||||||
], | ||||||
summary: | ||||||
"Audio Text to Text tasks convert audio input into textual output. This primarily includes automatic speech recognition (transcribing audio to text in the same language) and speech translation (translating audio in one language to text in another).", | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
these models can take open ended sound file + text prompt and then output text, like voice chat |
||||||
widgetModels: ["openai/whisper-base"], | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is not an audio-text-to-text model, if it doesn't exist in widgets we can leave it empty |
||||||
youtubeId: "SqE7xeyjBFg", // Example: A video about Whisper | ||||||
}; | ||||||
|
||||||
export default taskData; |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
{ | ||
"$schema": "http://json-schema.org/draft-07/schema#", | ||
"title": "AudioTextToTextInput", | ||
"type": "object", | ||
"properties": { | ||
"inputs": { | ||
"type": "string", | ||
"format": "binary", | ||
"description": "The audio input to be processed." | ||
}, | ||
"parameters": { | ||
"type": "object", | ||
"properties": { | ||
"generate_kwargs": { | ||
"type": "object", | ||
"description": "Keyword arguments to control generation. Varies by model." | ||
} | ||
} | ||
} | ||
}, | ||
"required": [ | ||
"inputs" | ||
] | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
{ | ||
"$schema": "http://json-schema.org/draft-07/schema#", | ||
"title": "AudioTextToTextOutput", | ||
"type": "array", | ||
"items": { | ||
"type": "object", | ||
"properties": { | ||
"text": { | ||
"type": "string", | ||
"description": "The generated text from the audio input." | ||
} | ||
}, | ||
"required": [ | ||
"text" | ||
] | ||
} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -45,6 +45,7 @@ import imageTo3D from "./image-to-3d/data.js"; | |
import textTo3D from "./text-to-3d/data.js"; | ||
import keypointDetection from "./keypoint-detection/data.js"; | ||
import videoTextToText from "./video-text-to-text/data.js"; | ||
import audioTextToText from "./audio-text-to-text/data.js"; | ||
|
||
export type * from "./audio-classification/inference.js"; | ||
export type * from "./automatic-speech-recognition/inference.js"; | ||
|
@@ -121,7 +122,7 @@ export const TASKS_MODEL_LIBRARIES: Record<PipelineType, ModelLibraryKey[]> = { | |
"audio-classification": ["speechbrain", "transformers", "transformers.js"], | ||
"audio-to-audio": ["asteroid", "fairseq", "speechbrain"], | ||
"automatic-speech-recognition": ["espnet", "nemo", "speechbrain", "transformers", "transformers.js"], | ||
"audio-text-to-text": [], | ||
"audio-text-to-text": ["transformers", "speechbrain", "espnet", "nemo"], | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think only transformers supports it |
||
"depth-estimation": ["transformers", "transformers.js"], | ||
"document-question-answering": ["transformers", "transformers.js"], | ||
"feature-extraction": ["sentence-transformers", "transformers", "transformers.js"], | ||
|
@@ -205,7 +206,7 @@ export const TASKS_DATA: Record<PipelineType, TaskData | undefined> = { | |
"any-to-any": getData("any-to-any", anyToAny), | ||
"audio-classification": getData("audio-classification", audioClassification), | ||
"audio-to-audio": getData("audio-to-audio", audioToAudio), | ||
"audio-text-to-text": getData("audio-text-to-text", placeholder), | ||
"audio-text-to-text": getData("audio-text-to-text", audioTextToText), | ||
"automatic-speech-recognition": getData("automatic-speech-recognition", automaticSpeechRecognition), | ||
"depth-estimation": getData("depth-estimation", depthEstimation), | ||
"document-question-answering": getData("document-question-answering", documentQuestionAnswering), | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for this