Skip to content

Commit dcd4076

Browse files
committed
Update internvl_g(enerative) & support lora
1 parent 4600a32 commit dcd4076

21 files changed

+590
-347
lines changed

internvl_g/README.md

Lines changed: 165 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,18 @@
1-
# InternVL Stage-2 Pre-training
1+
# InternVL Stage-2 Pre-training & Retrieval Fine-tuning
22

3-
This folder contains the implementation of the InternVL for stage2 pre-training and retrieval fine-tuning.
3+
This folder contains the implementation of the InternVL 1.0 for stage2 pre-training and retrieval fine-tuning, which corresponds to Section 4.3 of our [InternVL 1.0 paper](https://arxiv.org/pdf/2312.14238).
4+
5+
![image](https://github.com/user-attachments/assets/239f38b2-8867-4539-9dd8-c1a1eaa40aef)
46

57
## 🛠️ Installation
68

7-
See [INSTALLATION.md](../INSTALLATION.md)
9+
Follow the [installation guide](../INSTALLATION.md) to perform installations.
810

911
## 📦 Data Preparation
1012

1113
Three datasets need to be prepared: COCO Caption, Flickr30K, and NoCaps.
1214

13-
<details>
15+
<details open>
1416
<summary>COCO Caption</summary>
1517

1618
```bash
@@ -31,7 +33,7 @@ cd ../../../
3133

3234
</details>
3335

34-
<details>
36+
<details open>
3537
<summary>Flickr30K</summary>
3638

3739
```bash
@@ -54,7 +56,7 @@ cd ../..
5456

5557
</details>
5658

57-
<details>
59+
<details open>
5860
<summary>NoCaps</summary>
5961

6062
```bash
@@ -69,6 +71,8 @@ cd ../..
6971

7072
</details>
7173

74+
After the download is complete, the directory structure is:
75+
7276
```shell
7377
data
7478
├── coco
@@ -103,39 +107,19 @@ Please download the above model weights and place them in the `pretrained/` fold
103107
```sh
104108
cd pretrained/
105109
# pip install -U huggingface_hub
106-
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL-14B-224px --local-dir internvl_14b_224px
110+
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL-14B-224px --local-dir InternVL-14B-224px
107111
```
108112

109113
The directory structure is:
110114

111115
```sh
112116
pretrained
113-
└── internvl_14b_224px/
114-
```
115-
116-
## 🔥 Pre-training
117-
118-
Coming Soon
119-
120-
## 🔥 Retrieval Fine-tuning
121-
122-
To fine-tune InternVL on Flickr30K with 32 GPUs and slurm system, run:
123-
124-
```bash
125-
GPUS=32 sh shell/finetune/internvl_stage2_finetune_flickr_364_bs1024_ep10.sh
126-
```
127-
128-
To fine-tune InternVL on Flickr30K-CN with 32 GPUs and slurm system, run:
129-
130-
```shell
131-
GPUS=32 sh shell/finetune/internvl_stage2_finetune_flickrcn_364_bs1024_ep10.sh
117+
└── InternVL-14B-224px/
132118
```
133119

134-
To fine-tune InternVL on COCO with 32 GPUs and slurm system, run:
120+
## 🔥 Generative Pre-training
135121

136-
```shell
137-
GPUS=32 sh shell/finetune/internvl_stage2_finetune_coco_364_bs1024_ep5.sh
138-
```
122+
There are currently no plans to release this part of the code.
139123

140124
## 📊 Evaluation
141125

@@ -151,7 +135,7 @@ GPUS=32 sh shell/finetune/internvl_stage2_finetune_coco_364_bs1024_ep5.sh
151135
<summary>[InternVL-G] COCO Karpathy test</summary>
152136

153137
```bash
154-
sh evaluate.sh pretrained/internvl_14b_224px caption-coco
138+
sh evaluate.sh pretrained/InternVL-14B-224px caption-coco
155139
```
156140

157141
Expected results:
@@ -166,7 +150,7 @@ Expected results:
166150
<summary>[InternVL-G] Flickr30K Karpathy test</summary>
167151

168152
```
169-
sh evaluate.sh pretrained/internvl_14b_224px caption-flickr30k
153+
sh evaluate.sh pretrained/InternVL-14B-224px caption-flickr30k
170154
```
171155

172156
Expected results:
@@ -181,7 +165,7 @@ Expected results:
181165
<summary>[InternVL-G] NoCaps val</summary>
182166

183167
```bash
184-
sh evaluate.sh pretrained/internvl_14b_224px caption-nocaps
168+
sh evaluate.sh pretrained/InternVL-14B-224px caption-nocaps
185169
```
186170

187171
Expected results:
@@ -197,13 +181,13 @@ Expected results:
197181
#### Flickr30K fine-tuned model: [InternVL-14B-Flickr30K-FT-364px](https://huggingface.co/OpenGVLab/InternVL-14B-Flickr30K-FT-364px)
198182

199183
<table>
200-
<tr align=center>
184+
<tr align=center>
201185
<td rowspan="3" align=center><b>model</b></td>
202186
<td colspan="6" align=center><b>Flickr30K</b></td>
203187
<td rowspan="3" align=center><b>avg</b></td>
204188

205189
</tr>
206-
<tr align=center>
190+
<tr align=center>
207191
<td colspan="3" align=center><b>image-to-text</b></td>
208192
<td colspan="3" align=center><b>text-to-image</b></td>
209193
</tr>
@@ -284,13 +268,13 @@ Expected results:
284268
#### Flickr30K-CN fine-tuned model: [InternVL-14B-FlickrCN-FT-364px](https://huggingface.co/OpenGVLab/InternVL-14B-FlickrCN-FT-364px)
285269

286270
<table>
287-
<tr align=center>
271+
<tr align=center>
288272
<td rowspan="3" align=center><b>model</b></td>
289273
<td colspan="6" align=center><b>Flickr30K-CN</b></td>
290274
<td rowspan="3" align=center><b>avg</b></td>
291275

292276
</tr>
293-
<tr align=center>
277+
<tr align=center>
294278
<td colspan="3" align=center><b>image-to-text</b></td>
295279
<td colspan="3" align=center><b>text-to-image</b></td>
296280
</tr>
@@ -367,3 +351,147 @@ Expected results:
367351
```
368352

369353
</details>
354+
355+
## 🔥 Retrieval Fine-tuning (Fully)
356+
357+
> Note: In our experiments, full parameter fine-tuning achieves the best results on image-text retrieval tasks in Flickr30K and COCO. By following the experimental hyperparameters in this section, you can reproduce the model performance reported in the [Evaluation section](#evaluation).
358+
359+
To fine-tune InternVL on Flickr30K with 32 GPUs and slurm system, run:
360+
361+
```bash
362+
PARTITION='your partition' GPUS=32 sh shell/finetune/internvl_stage2_finetune_flickr_364_bs1024_ep10.sh
363+
```
364+
365+
To fine-tune InternVL on Flickr30K-CN with 32 GPUs and slurm system, run:
366+
367+
```shell
368+
PARTITION='your partition' GPUS=32 sh shell/finetune/internvl_stage2_finetune_flickrcn_364_bs1024_ep10.sh
369+
```
370+
371+
To fine-tune InternVL on COCO with 32 GPUs and slurm system, run:
372+
373+
```shell
374+
PARTITION='your partition' GPUS=32 sh shell/finetune/internvl_stage2_finetune_coco_364_bs1024_ep5.sh
375+
```
376+
377+
The hyperparameters used here are:
378+
379+
| config | Flickr30K | Flickr30K-CN | COCO |
380+
| --------------------------- | ----------------------------------- | ----------------------------------- | ----------------------------------- |
381+
| learning rate | 1e-6 | 1e-6 | 1e-6 |
382+
| layer-wise lr<br>decay rate | InternViT-6B (0.9),<br>QLLaMA (0.9) | InternViT-6B (0.9),<br>QLLaMA (0.9) | InternViT-6B (0.9),<br>QLLaMA (0.9) |
383+
| optimizer | AdamW | AdamW | AdamW |
384+
| weight decay | 0.05 | 0.05 | 0.05 |
385+
| input resolution | 364x364 | 364x364 | 364x364 |
386+
| total batch size | 1024 | 1024 | 1024 |
387+
| warm-up iterations | 100 | 100 | 100 |
388+
| training epochs | 10 | 10 | 5 |
389+
| drop path rate | 0.3 | 0.3 | 0.3 |
390+
| numerical precision | zero1 + bf16 | zero1 + bf16 | zero1 + bf16 |
391+
| trainable / total params | 14B / 14B | 14B / 14B | 14B / 14B |
392+
| GPUs for training | 32×A100 (80G) | 32×A100 (80G) | 32×A100 (80G) |
393+
| Required GPU memory | 80G | 80G | 80G |
394+
395+
## 🔥 Retrieval Fine-tuning (Head)
396+
397+
> Note: This section demonstrates how to perform a cost-effective fine-tuning of our model. The hyperparameters shown here are not optimized for any specific task. For practical applications, further adjustments to the hyperparameters may be necessary to achieve optimal performance.
398+
399+
To fine-tune the head of InternVL on Flickr30K with 4 GPUs, run:
400+
401+
```bash
402+
GPUS=4 BATCH_SIZE=32 sh shell/head_finetune/internvl_stage2_finetune_flickr_224_bs1024_ep10_head_4gpu.sh
403+
```
404+
405+
To fine-tune the head of InternVL on Flickr30K-CN with 4 GPUs, run:
406+
407+
```shell
408+
GPUS=4 BATCH_SIZE=32 sh shell/head_finetune/internvl_stage2_finetune_flickrcn_224_bs1024_ep10_head_4gpu.sh
409+
```
410+
411+
To fine-tune the head of InternVL on COCO with 4 GPUs, run:
412+
413+
```shell
414+
GPUS=4 BATCH_SIZE=32 shell/head_finetune/internvl_stage2_finetune_coco_224_bs1024_ep5_head_4gpu.sh
415+
```
416+
417+
The hyperparameters used here are:
418+
419+
| config | Flickr30K | Flickr30K-CN | COCO |
420+
| ------------------------ | ------------- | ------------- | ------------- |
421+
| learning rate | 1e-6 | 1e-6 | 1e-6 |
422+
| optimizer | AdamW | AdamW | AdamW |
423+
| weight decay | 0.05 | 0.05 | 0.05 |
424+
| input resolution | 224x224 | 224x224 | 224x224 |
425+
| total batch size | 4x32 | 4x32 | 4x32 |
426+
| warm-up iterations | 100 | 100 | 100 |
427+
| training epochs | 10 | 10 | 5 |
428+
| drop path rate | 0.0 | 0.0 | 0.3 |
429+
| numerical precision | zero3 + bf16 | zero3 + bf16 | zero1 + bf16 |
430+
| trainable / total params | 0.2B / 14B | 0.2B / 14B | 0.2B / 14B |
431+
| GPUs for training | 4×GPU (>=32G) | 4×GPU (>=32G) | 4×GPU (>=32G) |
432+
| Required GPU memory | 24G | 24G | 24G |
433+
434+
## 🔥 Retrieval Fine-tuning (LoRA)
435+
436+
> Note: This section demonstrates how to perform a cost-effective fine-tuning of our model. The hyperparameters shown here are not optimized for any specific task. For practical applications, further adjustments to the hyperparameters may be necessary to achieve optimal performance.
437+
438+
To fine-tune InternVL using LoRA on Flickr30K with 4 GPUs, run:
439+
440+
```bash
441+
GPUS=4 BATCH_SIZE=32 sh shell/lora_finetune/internvl_stage2_finetune_flickr_224_bs1024_ep10_lora16_4gpu.sh
442+
```
443+
444+
To fine-tune InternVL using LoRA on Flickr30K-CN with 4 GPUs, run:
445+
446+
```shell
447+
GPUS=4 BATCH_SIZE=32 sh shell/lora_finetune/internvl_stage2_finetune_flickrcn_224_bs1024_ep10_lora16_4gpu.sh
448+
```
449+
450+
To fine-tune InternVL using LoRA on COCO with 4 GPUs, run:
451+
452+
```shell
453+
GPUS=4 BATCH_SIZE=32 shell/lora_finetune/internvl_stage2_finetune_coco_224_bs1024_ep5_lora16_4gpu.sh
454+
```
455+
456+
The hyperparameters used here are:
457+
458+
| config | Flickr30K | Flickr30K-CN | COCO |
459+
| ------------------------ | ------------- | ------------- | ------------- |
460+
| learning rate | 1e-6 | 1e-6 | 1e-6 |
461+
| optimizer | AdamW | AdamW | AdamW |
462+
| lora rank | 16 | 16 | 16 |
463+
| weight decay | 0.05 | 0.05 | 0.05 |
464+
| input resolution | 224x224 | 224x224 | 224x224 |
465+
| total batch size | 4x32 | 4x32 | 4x32 |
466+
| warm-up iterations | 100 | 100 | 100 |
467+
| training epochs | 10 | 10 | 5 |
468+
| drop path rate | 0.0 | 0.0 | 0.3 |
469+
| numerical precision | zero3 + bf16 | zero3 + bf16 | zero1 + bf16 |
470+
| trainable / total params | 0.3B / 14B | 0.3B / 14B | 0.3B / 14B |
471+
| GPUs for training | 4×GPU (>=40G) | 4×GPU (>=40G) | 4×GPU (>=40G) |
472+
| Required GPU memory | 37G | 37G | 37G |
473+
474+
## Fine-Tuning a Custom Dataset
475+
476+
1. **Organize Your Data**: Format your dataset similar to COCO or Flickr30K.
477+
478+
2. **Update Meta Information**: Add your dataset's meta information to the `ds_collections` dictionary in `internvl_g/internvl/train/internvl_stage2_finetune.py`. For example:
479+
480+
```python
481+
ds_collections = {
482+
'my_dataset_flickr_format': {
483+
'root': './data/my_dataset/images/',
484+
'annotation': './data/my_dataset/annotations.txt',
485+
},
486+
'my_dataset_coco_format': {
487+
'root': './data/my_dataset/',
488+
'annotation': './data/my_dataset/annotations.json',
489+
},
490+
}
491+
```
492+
493+
3. **Name Your Dataset**:
494+
495+
- Include `flickr_format` or `coco_format` in your dataset's `dataset_name`. This will allow the script to reuse the Flickr30K or COCO dataloader accordingly.
496+
497+
By following these steps, you can easily fine-tune the InternVL model on your custom dataset using the existing COCO or Flickr30K data loading mechanisms.

internvl_g/eval/evaluate_caption.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -220,8 +220,7 @@ def evaluate_qllama_model():
220220
parser.add_argument('--seed', type=int, default=0)
221221
args = parser.parse_args()
222222

223-
if not os.path.exists(args.out_dir):
224-
os.makedirs(args.out_dir)
223+
os.makedirs(args.out_dir, exist_ok=True)
225224

226225
args.datasets = args.datasets.split(',')
227226
print('datasets:', args.datasets)

internvl_g/internvl/model/internvl_stage2/modeling_internvl.py

Lines changed: 19 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -48,23 +48,23 @@ class InternVLPreTrainedModel(PreTrainedModel):
4848
_skip_keys_device_placement = 'past_key_values'
4949
_keep_in_fp32_modules = ['wo']
5050

51-
def _init_weights(self, module):
52-
"""Initialize the weights"""
53-
factor = self.config.initializer_range
54-
if isinstance(module, nn.Conv2d) or isinstance(module, nn.Embedding) or isinstance(module, nn.Linear):
55-
module.weight.data.normal_(mean=0.0, std=factor)
56-
if hasattr(module, 'bias') and module.bias is not None:
57-
module.bias.data.zero_()
58-
if isinstance(module, InternVisionEmbeddings):
59-
if hasattr(self.config, 'vision_config'):
60-
factor = self.config.vision_config.initializer_range
61-
nn.init.trunc_normal_(module.position_embedding, mean=0.0, std=factor)
62-
nn.init.trunc_normal_(module.class_embedding, mean=0.0, std=factor)
63-
elif isinstance(module, nn.LayerNorm):
64-
module.bias.data.zero_()
65-
module.weight.data.fill_(1.0)
66-
elif isinstance(module, nn.Linear) and module.bias is not None:
67-
module.bias.data.zero_()
51+
# def _init_weights(self, module):
52+
# """Initialize the weights"""
53+
# factor = self.config.initializer_range
54+
# if isinstance(module, nn.Conv2d) or isinstance(module, nn.Embedding) or isinstance(module, nn.Linear):
55+
# module.weight.data.normal_(mean=0.0, std=factor)
56+
# if hasattr(module, 'bias') and module.bias is not None:
57+
# module.bias.data.zero_()
58+
# if isinstance(module, InternVisionEmbeddings):
59+
# if hasattr(self.config, 'vision_config'):
60+
# factor = self.config.vision_config.initializer_range
61+
# nn.init.trunc_normal_(module.position_embedding, mean=0.0, std=factor)
62+
# nn.init.trunc_normal_(module.class_embedding, mean=0.0, std=factor)
63+
# elif isinstance(module, nn.LayerNorm):
64+
# module.bias.data.zero_()
65+
# module.weight.data.fill_(1.0)
66+
# elif isinstance(module, nn.Linear) and module.bias is not None:
67+
# module.bias.data.zero_()
6868

6969
def _set_gradient_checkpointing(self, module, value=False):
7070
if isinstance(module, InternVisionModel):
@@ -248,9 +248,9 @@ def __init__(self, config: InternVLConfig):
248248
# self.post_init()
249249

250250
if config.use_backbone_lora:
251-
self.wrap_backbone_lora(r=config.use_backbone_lora)
251+
self.wrap_backbone_lora(r=config.use_backbone_lora, lora_alpha=config.use_backbone_lora * 2)
252252
if config.use_qllama_lora:
253-
self.wrap_qllama_lora(r=config.use_qllama_lora)
253+
self.wrap_qllama_lora(r=config.use_qllama_lora, lora_alpha=config.use_qllama_lora * 2)
254254
if config.force_image_size:
255255
self.vision_model.resize_pos_embeddings(
256256
old_size=config.vision_config.image_size,

0 commit comments

Comments
 (0)