-
Notifications
You must be signed in to change notification settings - Fork 28
Open
Description
Thanks for your great work! I misunderstand the training data in the main paper.
In first paragraph, you use the same data as LLaVA, so does this mean that this model (using Vicuna as LLM) only features the image understanding not includes video?
In the second paragraph, you use the video data. Is it used when using LLaMA-3.1-8B or also when using Vicuna-7B?Does this mean that all the experiments on video evaluation in your paper were conducted based on LLaMA-3.1-8B?
Could you release more training details and code?
lily410
Metadata
Metadata
Assignees
Labels
No labels
