Training Details

Thanks for your great work! I misunderstand the training data in the main paper.

In first paragraph, you use the same data as LLaVA, so does this mean that this model (using Vicuna as LLM) only features the image understanding not includes video? 

In the second paragraph, you use the video data.  Is it used when using LLaMA-3.1-8B or also when using Vicuna-7B？Does this mean that all the experiments on video evaluation in your paper were conducted based on LLaMA-3.1-8B?

Could you release more training details and code？

![Image](https://github.com/user-attachments/assets/fe7a023b-ccf3-47f9-895b-cc435a771222)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Details #28

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Training Details #28

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions