You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: MultimodalQnA/README.md
+7-5Lines changed: 7 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
3
3
Suppose you possess a set of videos and wish to perform question-answering to extract insights from these videos. To respond to your questions, it typically necessitates comprehension of visual cues within the videos, knowledge derived from the audio content, or often a mix of both these visual elements and auditory facts. The MultimodalQnA framework offers an optimal solution for this purpose.
4
4
5
-
`MultimodalQnA` addresses your questions by dynamically fetching the most pertinent multimodal information (frames, transcripts, and/or captions) from your collection of videos. For this purpose, MultimodalQnA utilizes [BridgeTower model](https://huggingface.co/BridgeTower/bridgetower-large-itm-mlm-gaudi), a multimodal encoding transformer model which merges visual and textual data into a unified semantic space. During the video ingestion phase, the BridgeTower model embeds both visual cues and auditory facts as texts, and those embeddings are then stored in a vector database. When it comes to answering a question, the MultimodalQnA will fetch its most relevant multimodal content from the vector store and feed it into a downstream Large Vision-Language Model (LVM) as input context to generate a response for the user.
5
+
`MultimodalQnA` addresses your questions by dynamically fetching the most pertinent multimodal information (frames, transcripts, and/or captions) from your collection of videos, images, and audio files. For this purpose, MultimodalQnA utilizes [BridgeTower model](https://huggingface.co/BridgeTower/bridgetower-large-itm-mlm-gaudi), a multimodal encoding transformer model which merges visual and textual data into a unified semantic space. During the ingestion phase, the BridgeTower model embeds both visual cues and auditory facts as texts, and those embeddings are then stored in a vector database. When it comes to answering a question, the MultimodalQnA will fetch its most relevant multimodal content from the vector store and feed it into a downstream Large Vision-Language Model (LVM) as input context to generate a response for the user.
6
6
7
7
The MultimodalQnA architecture shows below:
8
8
@@ -100,10 +100,12 @@ In the below, we provide a table that describes for each microservice component
100
100
101
101
By default, the embedding and LVM models are set to a default value as listed below:
-X POST -F "files=@./${image_fn}" -F "files=@./${caption_fn}"
300
322
```
301
323
302
-
Also, you are able to get the list of all videos that you uploaded:
324
+
Also, you are able to get the list of all files that you uploaded:
303
325
304
326
```bash
305
327
curl -X POST \
306
328
-H "Content-Type: application/json" \
307
-
${DATAPREP_GET_VIDEO_ENDPOINT}
329
+
${DATAPREP_GET_FILE_ENDPOINT}
308
330
```
309
331
310
-
Then you will get the response python-style LIST like this. Notice the name of each uploaded video e.g., `videoname.mp4` will become `videoname_uuid.mp4` where `uuid` is a unique ID for each uploaded video. The same video that are uploaded twice will have different `uuid`.
332
+
Then you will get the response python-style LIST like this. Notice the name of each uploaded file e.g., `videoname.mp4` will become `videoname_uuid.mp4` where `uuid` is a unique ID for each uploaded file. The same files that are uploaded twice will have different `uuid`.
-X POST -F "files=@./${image_fn}" -F "files=@./${caption_fn}"
252
271
```
253
272
254
-
Also, you are able to get the list of all videos that you uploaded:
273
+
Also, you are able to get the list of all files that you uploaded:
255
274
256
275
```bash
257
276
curl -X POST \
258
277
-H "Content-Type: application/json" \
259
-
${DATAPREP_GET_VIDEO_ENDPOINT}
278
+
${DATAPREP_GET_FILE_ENDPOINT}
260
279
```
261
280
262
-
Then you will get the response python-style LIST like this. Notice the name of each uploaded video e.g., `videoname.mp4` will become `videoname_uuid.mp4` where `uuid` is a unique ID for each uploaded video. The same video that are uploaded twice will have different `uuid`.
281
+
Then you will get the response python-style LIST like this. Notice the name of each uploaded file e.g., `videoname.mp4` will become `videoname_uuid.mp4` where `uuid` is a unique ID for each uploaded file. The same files that are uploaded twice will have different `uuid`.
0 commit comments