Finetune Problem 微调的各种问题 #270

zRzRzRzRzRzRzR · 2024-01-02T14:00:27Z

zRzRzRzRzRzRzR
Jan 2, 2024
Maintainer

Please articulate any questions about model fine-tuning here; these questions will be answered by community members and officials in their spare time.

在这里阐述任何关于模型微调的问题，这些问题将由社区成员和官方在空闲的时间进行回答

jesanwang · 2024-01-03T09:17:33Z

jesanwang
Jan 3, 2024

您好！很高兴发现了CogLVM

关于微调有一下问题：
1、split_dataset.py 中对微调训练资料进行处理，archive目录中只有jpg图形文件。
2、而官网提供的训练资料 “CogVLM-SFT-311K” 文件夹中包含 images、labels_en、labels_zh三个文件夹，既有图形文件，也有json的图形解释和问答资料。
3、我们在工业领域测试使用CogLVM，有很多电气设备放电图片和我们自己的解释资料，希望也能把资料分别放入images、labels_en、labels_zh进行微调。
4、目前的demo不支持，我们如何做，谢谢！

7 replies

jesanwang Jan 12, 2024

这个问题请回答下，谢谢

hanswang1 Jan 18, 2024

确实，Option 3, finetune CogAgent/CogVLM所给的captcha数据集
https://www.kaggle.com/datasets/aadhavvignesh/captcha-images
只有图片，没有label_en/label_zh文件夹也没有json文件。

zhuliyi0 Jan 25, 2024

这个demo在dataset.py中把图片文件的文件名作为caption，对应简单的一问一答。我也想搞明白多轮问答的数据处理应该怎样处理喂给模型。谢谢！

zhuliyi0 Jan 29, 2024

seems like it is partially answered here:

#93

(closed issues里包含了这个贴里的一些问题)

yugangya Jan 31, 2024

我也遇到了这个问题，自定义数据集images中的图片名称对应lables中的json名称，但是该如何喂给模型，还在研究中

dingtine · 2024-01-04T11:12:20Z

dingtine
Jan 4, 2024

hi,我在按照示例来微调
报错
Traceback (most recent call last):
File "finetune_demo/finetune_cogagent_demo.py", line 266, in
model, args = FineTuneTrainCogAgentModel.from_pretrained(args.from_pretrained, args, overwrite_args={'model_parallel_size': args.model_parallel_size} if args.model_parallel_size != 1 else {})
File "/opt/conda/lib/python3.8/site-packages/sat/model/base_model.py", line 215, in from_pretrained
return cls.from_pretrained_base(name, args=args, home_path=home_path, url=url, prefix=prefix, build_only=build_only, overwrite_args=overwrite_args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/sat/model/base_model.py", line 207, in from_pretrained_base
model = get_model(args, cls, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/sat/model/base_model.py", line 406, in get_model
model = model_cls(args, params_dtype=params_dtype, **kwargs)
File "/mntnlp/tine/gui/CogVLM/utils/models/cogagent_model.py", line 203, in init
super().init(args, transformer=transformer, parallel_output=parallel_output, **kw_args)
File "/mntnlp/tine/gui/CogVLM/utils/models/cogagent_model.py", line 164, in init
self.add_mixin("eva", ImageMixin(args))
File "/mntnlp/tine/gui/CogVLM/utils/models/cogagent_model.py", line 118, in init
self.linear_proj = GLU(args, self.in_features)
File "/mntnlp/tine/gui/CogVLM/utils/models/cogagent_model.py", line 25, in init
self.dense_h_to_4h = nn.Linear(args.hidden_size, args.inner_hidden_size, bias=False)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 96, in init
self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))
TypeError: empty() received an invalid combination of arguments - got (tuple, dtype=NoneType, device=NoneType), but expected one of:

(tuple of ints size, *, tuple of names names, torch.memory_format memory_format, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)
(tuple of ints size, *, torch.memory_format memory_format, Tensor out, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)

我自己的参数是(_name_or_path='/mntnlp/common_base_model/cogvqa/cogagent', architectures=['CogAgentForCausalLM'], attention_dropout=0.1, auto_map={'AutoConfig': 'configuration_cogagent.CogAgentConfig', 'AutoModelForCausalLM': 'modeling_cogagent.CogAgentForCausalLM'}, batch_from_same_dataset=False, batch_size=4, bf16=False, block_size=10000, bos_token_id=1, checkpoint_activations=False, checkpoint_num_layers=1, checkpoint_skip_layers=0, cross_compute_hidden_size=1024, cross_hidden_size=1024, cross_image_pix=1120, cross_image_size=1120, cuda=True, deepscale=False, deepscale_config=None, deepspeed=True, deepspeed_activation_checkpointing=False, deepspeed_config={'train_micro_batch_size_per_gpu': 4, 'gradient_accumulation_steps': 1, 'gradient_clipping': 0.1, 'fp16': {'enabled': False, 'loss_scale': 0, 'loss_scale_window': 200, 'hysteresis': 2, 'min_loss_scale': 0.01}, 'bf16': {'enabled': False}, 'optimizer': {'type': 'AdamW', 'params': {'lr': 0.0001, 'weight_decay': 0.01}}}, deepspeed_mpi=False, device=0, distributed_backend='nccl', drop_path=0.0, eos_token_id=2, epochs=None, eva_args={'model_parallel_size': 1}, eval_batch_size=None, eval_interval=None, eval_iters=100, exit_interval=None, experiment_name='finetune-/mntnlp/common_base_model/cogvqa', fp16=False, from_pretrained='/mntnlp/common_base_model/cogvqa', gradient_accumulation_steps=1, hidden_act='silu', hidden_dropout=0.1, hidden_size=4096, hidden_size_per_attention_head=None, ignore_pad_token_for_loss=True, image_length=256, initializer_range=0.02, inner_hidden_size=None, input_source='interactive', intermediate_size=11008, iterable_dataset=False, layer_range=None, layernorm_epsilon=1e-05, layernorm_order='pre', length_penalty=0.0, load=None, local_rank=0, local_tokenizer='/mntnlp/common_base_model/vicuna_v1.5_7b', log_interval=50, lora_rank=50, lr=0.0001, lr_decay_iters=None, lr_decay_ratio=0.1, lr_decay_style='cosine', make_vocab_size_divisible_by=128, master_ip='127.0.0.1', master_port='16666', max_inference_batch_size=12, max_length=400, max_position_embeddings=2048, max_sequence_length=512, min_tgt_length=0, mode='finetune', model_parallel_size=1, no_load_rng=False, no_repeat_ngram_size=0, no_save_rng=False, num_attention_heads=32, num_beams=1, num_hidden_layers=32, num_layers=6, num_multi_query_heads=0, num_workers=1, out_seq_length=256, output_path='./samples', pad_token_id=0, pre_seq_len=8, prefetch_factor=4, rank=0, resume_dataloader=True, rms_norm_eps=1e-05, save=None, save_args=False, save_interval=5000, seed=1234, skip_init=False, split='1000,1,1', strict_eval=False, summary_dir='', temperature=1.0, template_version='chat', test_data=None, tie_word_embeddings=False, tokenizer_type='fake', top_k=0, top_p=0.0, torch_dtype='bfloat16', train_data=['./archive_split/train'], train_data_weights=None, train_iters=2000, transformers_version='4.36.0.dev0', use_cache=True, use_gpu_initialization=False, use_lora=True, use_ptuning=False, use_qlora=False, valid_data=['./archive_split/valid'], version='chat', vision_config={'dropout_prob': 0.0, 'hidden_act': 'gelu', 'hidden_size': 1792, 'image_size': 224, 'in_channels': 3, 'intermediate_size': 15360, 'layer_norm_eps': 1e-06, 'num_heads': 16, 'num_hidden_layers': 63, 'num_positions': 257, 'patch_size': 14}, vit_checkpoint_activations=False, vocab_size=32000, warmup=0.02, weight_decay=0.01, with_id=False, world_size=1, zero_stage=0)

这个inner_hidden_size默认怎么设置呢，感激

1 reply

Worromots Aug 19, 2024

解决了吗

dingtine · 2024-01-05T09:10:37Z

dingtine
Jan 5, 2024

enable = ["encoder", "cross_attention", "linear_proj", 'mlp.vision', 'rotary.vision', 'eoi', 'boi', 'vit']
这些参数都是哪些部分呢，特别是eoi boi

1 reply

chensongcan Apr 9, 2024

同问，麻烦回复一下

menglin0320 · 2024-01-05T22:55:57Z

menglin0320
Jan 5, 2024

This can be somewhat a duplicated question,
I want to finetune cogagent on my own question answering data in a format similar to docvqa, I slightly modified ItemDataset to load the prompts and I'm not sure if it's the right thing to do. The machine I'm using has 2 A100 gpus, each of them have 80 gigabytes of video memory.
what I found was that even with 80 gigabytes of video memory, I can not train the model with lora with single gpu. It run into oom when setting up the optimizers. I also tried to set model-parallel-size to 2 but then I get this problem.
RuntimeError: The size of tensor a (256) must match the size of tensor b (512) at non-singleton dimension 1.
I also tried to disable lora then train the model with model-parallel-size=2, it results on OOM.

So here are my questions:

based on my calculation a 18 billion model takes about 67 gigabytes of memory. I assume that to train it it may take 400 gigabytes of memory? I believe it should take less video memory because of the 'two image encoder technique' but I'm not sure how efficient it is. With current code it seems that it's impossible to use lora with model-parallel-size > 1. I wonder if my assumption is right and I wonder if you guys are willing to share how you guys trained the model? Did you guys use lora with model-parallel-size>1? Can you guys provide me some estimation for resources required?
I'm assuming that for data preprocess you guys resize the image to 1120 and 224 images as described on paper. The samples in toy dataset are very small images so I want to make sure the preprocess is done this way. I have this question mainly because in this guys issue, though we both have tensor size not match issue with lora, his tensor are a lot smaller.

1 reply

menglin0320 Jan 6, 2024

Hmm with default setting, after lora there are still 11.6b parameters, changing the rank to 1 only decreased it to 11.5b. So you guys didn't apply it on some weights right. I would appreciate it a lot if you guys can give me some advise for modify the lora range.

zodiacg · 2024-01-12T14:25:27Z

zodiacg
Jan 12, 2024

我在8*3090上试图精调CogAgent模型，设置MP_SIZE=4，内存（500G）在模型加载阶段就会爆掉。请问有什么办法能够减少这部分内存占用吗？

3 replies

menglin0320 Jan 12, 2024

我发现他们lora设置的很奇怪，按lora的paper应该是lora apply在所有attention的layer上然后trainable parameter设置成只有lora出来的心得delta w。但他们把lora的parameter加到了trainable parameter里。我把这个改了加上deepspeed model parameters之外的video memory损耗就相当少了。 parameters 他们用了bfloat16 大概要35g

menglin0320 Jan 12, 2024

这个repo他们还没implement batch inference..

zodiacg Jan 13, 2024

感谢解答！麻烦问下具体应该在哪里修改呢？我之前做小模型2333刚刚接触LLM的Lora精调之类的……

zodiacg · 2024-01-15T01:51:32Z

zodiacg
Jan 15, 2024

遇到了跟 #268 中一样的问题。除了MP_SIZE和NUM_GPUS_PER_WORKER外没有修改其他参数，基于官方SAT拖的权重进行精调，在[RANK 0] replacing layer 0 cross attention with lora后报错：
RuntimeError: The size of tensor a (64) must match the size of tensor b (256) at non-singleton dimension 1
因为多进程stack比较乱，最后报错的是lora2.py的L95。相同参数cogvlm精调正常，猜测是lora替换的时候对RowParallelLinear的处理出现了问题？

4 replies

zRzRzRzRzRzRzR Jan 15, 2024
Maintainer Author

这个感觉是你输入没有padding?

zodiacg Jan 15, 2024

这个是loading checkpoint后，处理lora层的时候报的错，应该没关系？稍微复原了一下Traceback是这样的：

Traceback (most recent call last):
  File "/home/zodiacg/CogVLM_off/finetune_demo/finetune_cogagent_demo.py", line 269, in <module>
    model.add_mixin("lora", LoraMixin(args.num_layers, args.lora_rank, layer_range=args.layer_range), reinit=True)
  File "/home/zodiacg/SwissArmyTransformer/sat/model/base_model.py", line 122, in add_mixin
    new_mixin.reinit(self)  # also pass current mixins
  File "/home/zodiacg/SwissArmyTransformer/sat/model/finetune/lora2.py", line 200, in reinit
    parent_model.transformer.layers[i].cross_attention.dense = replace_linear_with_lora(parent_model.transformer.layers[i].cross_attention.dense, 1, self.r, self.lora_alpha, self.lora_dropout, qlora=self.qlora)
  File "/home/zodiacg/SwissArmyTransformer/sat/model/finetune/lora2.py", line 145, in replace_linear_with_lora
    new_layer = LoraLinear(original_cls, partition, in_dim, out_dim, r, *args, **kw_args, original_obj=lin)
  File "/home/zodiacg/SwissArmyTransformer/sat/model/finetune/lora2.py", line 95, in __init__
    self.original.weight.data.copy_(original_obj.weight.data.detach().clone())
RuntimeError: The size of tensor a (64) must match the size of tensor b (256) at non-singleton dimension 1

这个64会随着MP_SIZE的不同而变化。在此之前的logline是这样的：

[2024-01-15 09:45:30,506] [INFO] [RANK 0] > initializing model parallel with size 4
INFO:sat:[RANK 0] > initializing model parallel with size 4
[2024-01-15 09:45:45,806] [INFO] [RANK 0] replacing layer 0 attention with lora
INFO:sat:[RANK 0] replacing layer 0 attention with lora
[2024-01-15 09:45:46,513] [INFO] [RANK 0] replacing layer 0 cross attention with lora
INFO:sat:[RANK 0] replacing layer 0 cross attention with lora

顺带一提cogagent的logline从building FineTuneTrainCogAgentModel model之后就开始重复……CogVLM没有这个问题

zodiacg Jan 15, 2024

调试了一下还是没搞懂为什么会出错，RowParallelLinear替换在前面是正常的。但是唯独此处，在 lora2.py#L94 之前传进去[4096,256]的参数，出来得到的实例就变成[4096,64]了

JohnTang93 Jan 17, 2024

我也遇到了这个问题

hanswang1 · 2024-01-18T01:03:54Z

hanswang1
Jan 18, 2024

想问一下，微调能放开backend的各层参数来微调吗？如果能的话，具体怎么做？

0 replies

hanswang1 · 2024-01-18T01:06:18Z

hanswang1
Jan 18, 2024

我按照demo的实例，下载好图片并设置好路径之后执行 bash finetune_cogvlm_lora.sh，得到下面的错误，想知道是什么原因

2 replies

ttppttppttpp Jan 18, 2024

sat的bug，在最新版本中修复了，先uninstall掉原来的sat，然后按照以下方式安装新的sat：
git clone https://github.com/THUDM/SwissArmyTransformer
cd SwissArmyTransformer
pip install . --no-deps

hanswang1 Jan 18, 2024

这个ERROR的地方过了，谢谢

VladyslavDoc · 2024-01-19T19:18:01Z

VladyslavDoc
Jan 19, 2024

Hello!
I wonder if there is any development in the direction of adding the possibility of fine tuning with different types of quantization such as int4 or int8?

0 replies

abravesailor · 2024-01-24T07:09:20Z

abravesailor
Jan 24, 2024

请问现在有没有对hf model进行Lora或者qlora的代码呢？为什么用提供的finetune script 无法微调本地下载的hf model呢？

7 replies

VladyslavDoc Jan 24, 2024

Good luck

zRzRzRzRzRzRzR Jan 24, 2024
Maintainer Author

hf的微调代码由于开发组较忙一直没有时间做之后会放出一个版本

abravesailor Jan 24, 2024

我要是写好了是不是可以给你们提一个MR之类的？

zRzRzRzRzRzRzR Jan 24, 2024
Maintainer Author

可以，我fork的分支已经写了，欢迎提出PR，可以参考我的失败案例

cocoshe Feb 2, 2024

请问一下现在的eval和finetune脚本是都仅支持sat，暂时没有提供hf的版本嘛？

heya5 · 2024-01-24T09:39:21Z

heya5
Jan 24, 2024

CogAgent的pretrain数据和qa格式的微调数据有计划开源吗?

1 reply

zRzRzRzRzRzRzR Jan 24, 2024
Maintainer Author

暂时没有

SunLang115 · 2024-01-24T13:20:37Z

SunLang115
Jan 24, 2024

bash finetune_demo/finetune_cogvlm_lora.sh
这一步出现这样的错误：ImportError: cannot import name 'ZopeTransactionExtension' from 'zope.sqlalchemy' (/home/sunkaijie/anaconda3/envs/cogvlm/lib/python3.11/site-packages/zope/sqlalchemy/init.py)

5 replies

zhuliyi0 Jan 24, 2024

看着像没安装完

SunLang115 Jan 24, 2024

我不咋会，这个按照官方来的都没成功，能帮帮我不，感谢

zhuliyi0 Jan 24, 2024

我也不是非常懂。最好把错误全文完整放上来。其它的网站能搜到这个错误，不知道会不会有帮助。

SunLang115 Jan 24, 2024

能加个联系方式帮我看看不。球球了

SunLang115 Jan 24, 2024

错误老长了，而且是我修改了好几个能解决的了，然后出现的，找了很久没找到，估计是我开始就有步骤不对造成的。

zhuliyi0 · 2024-01-24T13:20:45Z

zhuliyi0
Jan 24, 2024

finetune cogvlm最后一步save出来的文件大小是基底模型文件的两倍，有60+G，是因为float精度吗？请问怎样保存成比较小的文件可直接用于inference？

the final saved file in finetune cogvlm is almost twice as big as the base model file, 60+G, is it float precision? How to save to smaller file that can be used for inferencing?

6 replies

1049451037 Jan 25, 2024
Maintainer

可以手动把.pt里的除了'module'以外的状态删掉。

import torch
data = torch.load('mp_rank_00.pt')
for k in list(data.keys()):
    if k != 'module':
        del data[k]
torch.save(data, 'mp_rank_00_new.pt')

xxxwuwq Jan 25, 2024

对于cogvlm微调，训练时采用半精度训练，模型保存时，也会保存成半精度的；但是通过对比开放出来的SAT/cogvlm-chat-v1.1/mp_rank_00_model_states.pt，与自己微调训练出来的mp_rank_00_model_states.pt，中的字段 module与frozen_param_fragments,发现二者都用于保存模型权重，且都为训练时采用的半精度dtype，只是在开源的模型文件中，frozen_param_fragments字段的数据为空，自己微调的模型文件中却会保存一定数量的参数（也为半精度）

len(origin_model["frozen_param_fragments"]) = 0
len(finetune_model["frozen_param_fragments"]) = 1156
nums = [torch.prod(torch.tensor(list(v.shape))) for k, v in finetune_model["frozen_param_fragments"].items()]
tensor(14186905088)

微调训练后，保存的模型中的frozen_param_fragments部分是否可以直接删除？

zhuliyi0 Jan 25, 2024

追踪到了sat中的代码，发现把 deepspeed=None，mode=‘inference’，保存出来的文件是较小的。和merge_model.py中的设置一致。（merge_model.py并没有merge，只是读取保存的文件再转成inferrence版本。好像没有必要存在）

1049451037 Jan 25, 2024
Maintainer

merge_model.py是用来合并模型并行的，需要用torchrun多卡启动。如果你训练没用模型并行那就不需要了。

zhuliyi0 Jan 25, 2024

明白了，谢谢解释

xxxwuwq · 2024-01-25T04:02:59Z

xxxwuwq
Jan 25, 2024

可以手动把.pt里的除了'module'以外的状态删掉。

import torch
data = torch.load('mp_rank_00.pt')
for k in list(data.keys()):
    if k != 'module':
        del data[k]
torch.save(data, 'mp_rank_00_new.pt')

训练过程中模型阶段性保存，可以将这部分去掉？如何去掉？

6 replies

xxxwuwq Jan 25, 2024

这个不是指的merge_model过程中，而是在training_main过程中，我按照一定save_iters，来保存中间模型，以期后续具体测试选择一个表现最好的，中间过程模型保存时，我查看了下中间保存的模型，同样会将frozen_param_fragments权重进行保存，从而使得占用的存储翻倍；

model, args = FineTuneTrainCogVLMModel.from_pretrained(args.from_pretrained, args, overwrite_args={'model_parallel_size': args.model_parallel_size} if args.model_parallel_size != 1 else {})
    if args.use_ptuning:
        model.add_mixin("ptuning", PTuningV2Mixin(args.num_layers, args.hidden_size // args.num_attention_heads, args.num_attention_heads, args.pre_seq_len))
    if args.use_lora:
        model.add_mixin("lora", LoraMixin(args.num_layers, args.lora_rank, layer_range=args.layer_range), reinit=True)
        model.get_mixin("eva").vit_model.add_mixin("lora", LoraMixin(args.eva_args['num_layers'], args.lora_rank, layer_range=args.layer_range), reinit=True)
    elif args.use_qlora:
        model.add_mixin("lora", LoraMixin(args.num_layers, args.lora_rank, layer_range=args.layer_range, qlora=True), reinit=True)
        
    if args.use_qlora and torch.cuda.is_available():
        model = model.to('cuda')
    from utils.utils import llama2_tokenizer
    tokenizer = llama2_tokenizer(args.local_tokenizer, signal_type=args.version)
    image_processor = get_image_processor(args.eva_args["image_size"][0])
    text_processor = llama2_text_processor(tokenizer, args.max_length, args.image_length)

    model = training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=partial(create_dataset_function, image_processor, text_processor), collate_fn=data_collator, forward_step_eval=forward_step_eval)
    if args.use_lora:
        model.get_mixin("lora").merge_lora()
        model.get_mixin("eva").vit_model.get_mixin("lora").merge_lora()
        args.use_lora = False
        args.save = "checkpoints/merged_lora_cogvlm{}".format(args.eva_args["image_size"][0])
        from sat.training.model_io import save_checkpoint
        save_checkpoint(1, model, None, None, args)

1049451037 Jan 25, 2024
Maintainer

好吧，问问题的人为什么换了……所以你是要解决另一个问题，前一个人想解决的是内存的问题。你想解决的是硬盘的问题……

xxxwuwq Jan 25, 2024

嗯，主要想处理training_main过程中的，中间模型保存时包含的frozen_param_fragments这部分，其他，merge_model.py, 以及合并lora权重，都可以自己手动删除，

1049451037 Jan 25, 2024
Maintainer

这个没办法，可以去deepspeed仓库问一下，保存的时候他们目前是默认全保存的：

https://github.com/microsoft/DeepSpeed/blob/0dd0c615f8e6c7947ba81a4b0993284da5ec3209/deepspeed/runtime/engine.py#L3285-L3322

xxxwuwq Jan 25, 2024

ok, 看这个api，看起来是可以有参数控制的
exclude_frozen_parameters=False
save_frozen_param = self.zero_optimization_partition_gradients() and not exclude_frozen_parameters

def _save_checkpoint(self, save_dir, tag, client_state={}, exclude_frozen_parameters=False):

        save_path = self._get_ckpt_name(save_dir, tag)

        zero_optimizer_state = self.zero_optimization() or self.bfloat16_enabled()

        save_frozen_param = self.zero_optimization_partition_gradients() and not exclude_frozen_parameters

        # A hack to save the checkpointing directory. Pipeline parallelism overrides
        # module_state_dict() and uses this path to save the model. module_state_dict()
        # then instead just returns None.  The module_state_dict() implementation in
        # PipelineEngine expects the save path to be set in self._curr_ckpt_path.
        self._curr_ckpt_path = os.path.join(save_dir, tag)
        module = self.module_state_dict(exclude_frozen_parameters=exclude_frozen_parameters)
        self._curr_ckpt_path = None

        state = dict(module=module,
                     buffer_names=self._get_buffer_names(),
                     optimizer=self.optimizer.state_dict() if self.optimizer and not zero_optimizer_state else None,
                     param_shapes=self._get_zero_param_shapes() if self.optimizer and zero_optimizer_state else None,
                     frozen_param_shapes=self._get_zero_frozen_param_attributes(self._get_param_shape_func)
                     if save_frozen_param else None,
                     shared_params=self._get_shared_params() if self.optimizer and zero_optimizer_state else None,
                     frozen_param_fragments=self._get_zero_frozen_param_attributes(self._get_param_fragment_func)
                     if save_frozen_param else None,
                     lr_scheduler=self.lr_scheduler.state_dict() if self.lr_scheduler is not None else None,
                     data_sampler=self.training_dataloader.data_sampler.state_dict() if
                     (self.training_dataloader is not None and self.curriculum_learning_enabled()) else None,
                     random_ltd=self.random_ltd_scheduler.state_dict() if self.random_ltd_enabled() else None,
                     sparse_tensor_module_names=self.sparse_tensor_module_names,
                     skipped_steps=self.skipped_steps,
                     global_steps=self.global_steps,
                     global_samples=self.global_samples,
                     dp_world_size=self.seq_dp_world_size,
                     mp_world_size=self.mp_world_size,
                     ds_config=self.config,
                     ds_version=version)
        state.update(client_state)

zhuliyi0 · 2024-01-27T13:32:09Z

zhuliyi0
Jan 27, 2024

遇到问题：使用finetune_cogvlm_demo.py 跑几张图，跑了100步，发现loss一直是0，validation pred 文字也一直没有任何变化。

训练参数在原基础上改了batch_size=1，MP_num=1, 其它都没碰。因为vram不够，可训练参数没有包含lora，仅包含vit mlp 和 ptuning。

会是什么问题呢？

2 replies

zhuliyi0 Jan 29, 2024

好像loss没有被计算，validation主要看实际输出的文字内容主观判断训练进程。将训练模式从base改成vqa，打印出来每一步的loss和validation pred，训练确实在进行。

zhuliyi0 Jan 29, 2024

#63

EsonJohn · 2024-03-27T01:38:36Z

EsonJohn
Mar 27, 2024

请问怎么基于Lora微调的模型再进行Lora微调。我已经基于base模型lora微调了一个模型并保存了CKPT，我想加载这个CKPT后并联上Lora层，再在其他数据集上做微调，请问如何配置？

0 replies

ShChen233 · 2024-03-31T11:28:09Z

ShChen233
Mar 31, 2024

运行微调sh文件的时候出现


有人说是deepspeed的问题，但是反复尝试无果.....

3 replies

link89 Mar 31, 2024

不要用 pip 安装 torch, 用 conda 安装二进制兼容会更好

ShChen233 Apr 1, 2024

新建了个环境，conda 安装了torch2.1.1还是不太行.....

ShChen233 Apr 1, 2024

问题应该解决了，是cuda版本的问题，切到跟torch版本一样的cuda就解决了，但是报错显存不够。。。。。

ShChen233 · 2024-04-01T05:13:14Z

ShChen233
Apr 1, 2024

8卡A100运行finetune_cogagent_lora.sh爆显存....

8 replies

ShChen233 Apr 2, 2024

调成1也不行....
方便加个联系方式交流下吗

panhtt Apr 2, 2024

NUM_GPUS_PER_WORKER=8，MP_SIZE=8 应该是可以的，注意调整 batch_size 的大小，另外内存也要保证够大

link89 Apr 3, 2024

NUM_GPUS_PER_WORKER=4，MP_SIZE=4 在 4 * A100 80G， batch_size 试过4，2都不行，最后只能调成1跑起来了

link89 Apr 3, 2024

输入文件的分辨率会对VRAM使用率有影响么？

link89 Apr 3, 2024

@ShChen233 我的 cuda 环境是11.8, 如果通过原始的 pip install 安装 requirements 的话，执行的时候会有如下告警

WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.1.2+cu121 with CUDA 1201 (you have 2.1.2)
    Python  3.11.7 (you have 3.11.8)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details

字面上看起来版本不匹配会影响到内存使用的效率，降级XFORMERS到cuda11.8的版本就可以解决了，不过这里得小心它会安装 2.1.0 的torch, 要在 xformer装完后再用conda 把torch装回2.1.2

pip install xformers==0.0.22.post4 --index-url https://download.pytorch.org/whl/cu118
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=11.8 -c pytorch -c nvidia

如果直接用官方pytorch 2.1.2的 cuda12.1版本应该就不会有这么多问题了。
总之折腾下来 batch_size 1 是能运行的

wyclike · 2024-04-05T12:25:34Z

wyclike
Apr 5, 2024

想问一下有人试过，quant4 量化微调CogAgent嘛？我报维度错误：File "/root/miniconda3/envs/CogVLM/lib/python3.10/site-packages/sat/model/finetune/lora2.py", line 97, in init
self.original.weight.data.copy_(original_obj.weight.data.detach().clone())
RuntimeError: The size of tensor a (4096) must match the size of tensor b (2048) at non-singleton dimension 1；quant 8的话，没有这个错误，但是A40单卡OOM

0 replies

chensongcan · 2024-04-09T03:52:29Z

chensongcan
Apr 9, 2024

能否说明一下这些参数具体是指哪些模块：enable = ["encoder", "cross_attention", "linear_proj", 'mlp.vision', 'rotary.vision', 'eoi', 'boi', 'vit']，如果要冻结是不是仅仅在这里设置就可以了

1 reply

wyclike Apr 10, 2024

我微调的时候，直接置空了，enable=[]；我只微调transformers的32层

link89 · 2024-04-10T07:52:15Z

link89
Apr 10, 2024

微调如果要从没有merge过的checkpoint应该如何继续呢？

3 replies

wyclike Apr 10, 2024

我试了，可以在推理的时候所有的 from pretrain里的AutoModel类改为 FinetuneTestModel类

link89 Apr 10, 2024

不是指用 checkpoint 推理，是指意外中断后重新提交作业如何从还没有merged过的checkpoint 继续

link89 Apr 11, 2024

我看了几个 sat 的示例尝试用 --load checkpoints/finetune-cogagent-vqa-04-10-13-37 加载，看到它确实有尝试去找到latest迭代，但是报如下错误

ValueError: could not find the metadata file checkpoints/finetune-cogagent-vqa-04-10-13-37/1200/latest, please check --load
    raise ValueError('could not find the metadata file {}, please check --load'.format(
raise ValueError('could not find the metadata file {}, please check --load'.format(
ValueError: ValueErrorcould not find the metadata file checkpoints/finetune-cogagent-vqa-04-10-13-37/1200/latest, please check --load:
could not find the metadata file checkpoints/finetune-cogagent-vqa-04-10-13-37/1200/latest, please check --load
    raise ValueError('could not find the metadata file {}, please check --load'.format(
ValueError: could not find the metadata file checkpoints/finetune-cogagent-vqa-04-10-13-37/1200/latest, please check --load

Jackluisus · 2024-04-11T07:26:54Z

Jackluisus
Apr 11, 2024

请问多轮对话如何把多个问答拼接起来？如何修改dataset.py呢？

1 reply

WeiminLee May 15, 2024

这个问题需要自己修改CAPTCHA，然后修改process_text内部的逻辑，自己实现。

WeiminLee · 2024-05-15T06:42:38Z

WeiminLee
May 15, 2024

请问 finetune cogagent _demo 代码中使用的data_collator 返回的信息如下：
dict_keys(['input_ids', 'labels', 'position_ids', 'attention_mask',
'image_embed_mask', 'context_length', 'image_position', 'vision_expert_mask',
'image_rope_mask',
'vision_image', 'vision_input_ids', 'vision_position_ids', 'vision_attention_mask',
'cross_image', 'cross_input_ids', 'cross_position_ids', 'cross_attention_mask'])

但是我看cogagent模型的forward 方法中接受的参数并不是这些，好奇中间还有哪里做了什么处理吗？

0 replies

EsonJohn · 2024-05-15T06:45:29Z

EsonJohn
May 15, 2024

请问如何把4卡并行lora微调出来的模型合并成一个模型。

我加载了开源的cogvlm-base-490模型，使用MP_SIZE=4做了lora微调，保存了一个CKPT，包含4个文件（mp_rank_00_model_states.pt到mp_rank_03_model_states.pt ）。请问我如何把模型合并成MP_SIZE=1的模型并存储到一个文件呢。谢谢

0 replies

ghost · 2024-07-18T09:25:15Z

ghost
Jul 18, 2024

请问单机2*8张V100（32G）去lora微调CogAgent会out of memory，这个是正常的吗（batch size已设置为1，mp_size=16，NUM_GPUS_PER_WORKER=8）？是否是因为数据并行+单张卡内存太小，所以导致的out of memory呢？

另外16个V100微调时输出的日志显示单张卡上的参数量为1.5B，但rank 0这个卡上的参数量又是总参数量18B，这是为什么呢？

0 replies

Originlightwkp · 2024-07-18T15:26:59Z

Originlightwkp
Jul 18, 2024

I have a problem but no idea.

Here is the log.

(cogvlm) dhu_mbzhao_1@deeplearning-v191204-deeplearn:/CogVLM-main/apex-master$ cd /GLOBALFS/dhu_mbzhao_1/CogVLM-main/finetune_demo/
(cogvlm) dhu_mbzhao_1@deeplearning-v191204-deeplearn:/CogVLM-main/finetune_demo$ sh finetune_cogvlm_lora.sh
NCCL_DEBUG=info NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2 LOCAL_WORLD_SIZE=4 SAT_HOME=/GLOBALFS/dhu_mbzhao_1/CogVLM-main/.sat_models deepspeed --master_port 16666 --hostfile hostfile finetune_cogvlm_demo.py --experiment-name finetune-cogvlm-chat-v1.1 --model-parallel-size 1 --mode finetune --train-iters 800 --resume-dataloader --from_pretrained cogvlm-chat-v1.1 --max_length 1288 --lora_rank 10 --use_lora --local_tokenizer /GLOBALFS/dhu_mbzhao_1/CogVLM-main/vicuna-7b-v1.5 --version base --train-data ./archive_split/train --valid-data ./archive_split/valid --distributed-backend nccl --lr-decay-style cosine --warmup .02 --checkpoint-activations --vit_checkpoint_activations --save-interval 200 --eval-interval 200 --save ./checkpoints --eval-iters 10 --eval-batch-size 1 --split 1. --deepspeed_config test_config_bf16.json --skip-init --seed 2023
[2024-07-18 15:03:39,161] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[2024-07-18 15:03:40,797] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-07-18 15:03:40,797] [INFO] [runner.py:568:main] cmd = /GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=16666 --enable_each_rank_log=None finetune_cogvlm_demo.py --experiment-name finetune-cogvlm-chat-v1.1 --model-parallel-size 1 --mode finetune --train-iters 800 --resume-dataloader --from_pretrained cogvlm-chat-v1.1 --max_length 1288 --lora_rank 10 --use_lora --local_tokenizer /GLOBALFS/dhu_mbzhao_1/CogVLM-main/vicuna-7b-v1.5 --version base --train-data ./archive_split/train --valid-data ./archive_split/valid --distributed-backend nccl --lr-decay-style cosine --warmup .02 --checkpoint-activations --vit_checkpoint_activations --save-interval 200 --eval-interval 200 --save ./checkpoints --eval-iters 10 --eval-batch-size 1 --split 1. --deepspeed_config test_config_bf16.json --skip-init --seed 2023
[2024-07-18 15:03:42,018] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[2024-07-18 15:03:43,636] [INFO] [launch.py:139:main] 0 NCCL_DEBUG=info
[2024-07-18 15:03:43,636] [INFO] [launch.py:139:main] 0 NCCL_IB_DISABLE=0
[2024-07-18 15:03:43,636] [INFO] [launch.py:139:main] 0 NCCL_NET_GDR_LEVEL=2
[2024-07-18 15:03:43,636] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2024-07-18 15:03:43,636] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=4, node_rank=0
[2024-07-18 15:03:43,636] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2024-07-18 15:03:43,636] [INFO] [launch.py:164:main] dist_world_size=4
[2024-07-18 15:03:43,636] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2024-07-18 15:03:43,637] [INFO] [launch.py:256:main] process 56061 spawned with command: ['/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/bin/python', '-u', 'finetune_cogvlm_demo.py', '--local_rank=0', '--experiment-name', 'finetune-cogvlm-chat-v1.1', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '800', '--resume-dataloader', '--from_pretrained', 'cogvlm-chat-v1.1', '--max_length', '1288', '--lora_rank', '10', '--use_lora', '--local_tokenizer', '/GLOBALFS/dhu_mbzhao_1/CogVLM-main/vicuna-7b-v1.5', '--version', 'base', '--train-data', './archive_split/train', '--valid-data', './archive_split/valid', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--vit_checkpoint_activations', '--save-interval', '200', '--eval-interval', '200', '--save', './checkpoints', '--eval-iters', '10', '--eval-batch-size', '1', '--split', '1.', '--deepspeed_config', 'test_config_bf16.json', '--skip-init', '--seed', '2023']
[2024-07-18 15:03:43,637] [INFO] [launch.py:256:main] process 56062 spawned with command: ['/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/bin/python', '-u', 'finetune_cogvlm_demo.py', '--local_rank=1', '--experiment-name', 'finetune-cogvlm-chat-v1.1', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '800', '--resume-dataloader', '--from_pretrained', 'cogvlm-chat-v1.1', '--max_length', '1288', '--lora_rank', '10', '--use_lora', '--local_tokenizer', '/GLOBALFS/dhu_mbzhao_1/CogVLM-main/vicuna-7b-v1.5', '--version', 'base', '--train-data', './archive_split/train', '--valid-data', './archive_split/valid', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--vit_checkpoint_activations', '--save-interval', '200', '--eval-interval', '200', '--save', './checkpoints', '--eval-iters', '10', '--eval-batch-size', '1', '--split', '1.', '--deepspeed_config', 'test_config_bf16.json', '--skip-init', '--seed', '2023']
[2024-07-18 15:03:43,637] [INFO] [launch.py:256:main] process 56063 spawned with command: ['/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/bin/python', '-u', 'finetune_cogvlm_demo.py', '--local_rank=2', '--experiment-name', 'finetune-cogvlm-chat-v1.1', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '800', '--resume-dataloader', '--from_pretrained', 'cogvlm-chat-v1.1', '--max_length', '1288', '--lora_rank', '10', '--use_lora', '--local_tokenizer', '/GLOBALFS/dhu_mbzhao_1/CogVLM-main/vicuna-7b-v1.5', '--version', 'base', '--train-data', './archive_split/train', '--valid-data', './archive_split/valid', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--vit_checkpoint_activations', '--save-interval', '200', '--eval-interval', '200', '--save', './checkpoints', '--eval-iters', '10', '--eval-batch-size', '1', '--split', '1.', '--deepspeed_config', 'test_config_bf16.json', '--skip-init', '--seed', '2023']
[2024-07-18 15:03:43,638] [INFO] [launch.py:256:main] process 56064 spawned with command: ['/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/bin/python', '-u', 'finetune_cogvlm_demo.py', '--local_rank=3', '--experiment-name', 'finetune-cogvlm-chat-v1.1', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '800', '--resume-dataloader', '--from_pretrained', 'cogvlm-chat-v1.1', '--max_length', '1288', '--lora_rank', '10', '--use_lora', '--local_tokenizer', '/GLOBALFS/dhu_mbzhao_1/CogVLM-main/vicuna-7b-v1.5', '--version', 'base', '--train-data', './archive_split/train', '--valid-data', './archive_split/valid', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--vit_checkpoint_activations', '--save-interval', '200', '--eval-interval', '200', '--save', './checkpoints', '--eval-iters', '10', '--eval-batch-size', '1', '--split', '1.', '--deepspeed_config', 'test_config_bf16.json', '--skip-init', '--seed', '2023']
[2024-07-18 15:03:44,906] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-18 15:03:44,968] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-18 15:03:44,971] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-18 15:03:44,972] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[2024-07-18 15:03:49,192] [INFO] using world size: 4 and model-parallel size: 1
[2024-07-18 15:03:49,192] [INFO] > padded vocab (size: 100) with 28 dummy tokens (new size: 128)
[2024-07-18 15:03:49,192] [INFO] Will override arguments with manually specified deepspeed_config!
[2024-07-18 15:03:49,326] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-18 15:03:49,331] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-18 15:03:49,353] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-18 15:03:49,361] [INFO] [RANK 0] > initializing model parallel with size 1
[2024-07-18 15:03:49,363] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-18 15:03:49,366] [INFO] [checkpointing.py:1048:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
[2024-07-18 15:03:49,369] [INFO] [checkpointing.py:229:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 4741 and data parallel seed: 2023
[2024-07-18 15:03:49,372] [INFO] [RANK 0] building FineTuneTrainCogVLMModel model ...
[2024-07-18 15:03:59,465] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 17639685376
[2024-07-18 15:04:54,090] [INFO] [RANK 0] global rank 0 is loading checkpoint /GLOBALFS/dhu_mbzhao_1/CogVLM-main/.sat_models/cogvlm-chat-v1.1/1/mp_rank_00_model_states.pt
[2024-07-18 15:05:43,077] [INFO] [RANK 0] > successfully loaded /GLOBALFS/dhu_mbzhao_1/CogVLM-main/.sat_models/cogvlm-chat-v1.1/1/mp_rank_00_model_states.pt
[2024-07-18 15:05:44,114] [INFO] [RANK 0] replacing layer 0 attention with lora
[2024-07-18 15:05:44,864] [INFO] [RANK 0] replacing layer 1 attention with lora
[2024-07-18 15:05:45,654] [INFO] [RANK 0] replacing layer 2 attention with lora
[2024-07-18 15:05:46,351] [INFO] [RANK 0] replacing layer 3 attention with lora
[2024-07-18 15:05:47,077] [INFO] [RANK 0] replacing layer 4 attention with lora
[2024-07-18 15:05:47,871] [INFO] [RANK 0] replacing layer 5 attention with lora
[2024-07-18 15:05:48,692] [INFO] [RANK 0] replacing layer 6 attention with lora
[2024-07-18 15:05:49,551] [INFO] [RANK 0] replacing layer 7 attention with lora
[2024-07-18 15:05:50,375] [INFO] [RANK 0] replacing layer 8 attention with lora
[2024-07-18 15:05:51,153] [INFO] [RANK 0] replacing layer 9 attention with lora
[2024-07-18 15:05:51,949] [INFO] [RANK 0] replacing layer 10 attention with lora
[2024-07-18 15:05:52,892] [INFO] [RANK 0] replacing layer 11 attention with lora
[2024-07-18 15:05:53,677] [INFO] [RANK 0] replacing layer 12 attention with lora
[2024-07-18 15:05:54,587] [INFO] [RANK 0] replacing layer 13 attention with lora
[2024-07-18 15:05:55,295] [INFO] [RANK 0] replacing layer 14 attention with lora
[2024-07-18 15:05:56,079] [INFO] [RANK 0] replacing layer 15 attention with lora
[2024-07-18 15:05:56,938] [INFO] [RANK 0] replacing layer 16 attention with lora
[2024-07-18 15:05:57,762] [INFO] [RANK 0] replacing layer 17 attention with lora
[2024-07-18 15:05:58,654] [INFO] [RANK 0] replacing layer 18 attention with lora
[2024-07-18 15:05:59,468] [INFO] [RANK 0] replacing layer 19 attention with lora
[2024-07-18 15:06:00,300] [INFO] [RANK 0] replacing layer 20 attention with lora
[2024-07-18 15:06:01,055] [INFO] [RANK 0] replacing layer 21 attention with lora
[2024-07-18 15:06:02,043] [INFO] [RANK 0] replacing layer 22 attention with lora
[2024-07-18 15:06:02,786] [INFO] [RANK 0] replacing layer 23 attention with lora
[2024-07-18 15:06:03,570] [INFO] [RANK 0] replacing layer 24 attention with lora
[2024-07-18 15:06:04,406] [INFO] [RANK 0] replacing layer 25 attention with lora
[2024-07-18 15:06:05,249] [INFO] [RANK 0] replacing layer 26 attention with lora
[2024-07-18 15:06:06,080] [INFO] [RANK 0] replacing layer 27 attention with lora
[2024-07-18 15:06:06,862] [INFO] [RANK 0] replacing layer 28 attention with lora
[2024-07-18 15:06:08,048] [INFO] [RANK 0] replacing layer 29 attention with lora
[2024-07-18 15:06:08,829] [INFO] [RANK 0] replacing layer 30 attention with lora
[2024-07-18 15:06:09,577] [INFO] [RANK 0] replacing layer 31 attention with lora
[2024-07-18 15:06:10,367] [INFO] [RANK 0] replacing layer 0 attention with lora
[2024-07-18 15:06:10,480] [INFO] [RANK 0] replacing layer 1 attention with lora
[2024-07-18 15:06:10,589] [INFO] [RANK 0] replacing layer 2 attention with lora
[2024-07-18 15:06:10,832] [INFO] [RANK 0] replacing layer 3 attention with lora
[2024-07-18 15:06:11,036] [INFO] [RANK 0] replacing layer 4 attention with lora
[2024-07-18 15:06:11,243] [INFO] [RANK 0] replacing layer 5 attention with lora
[2024-07-18 15:06:11,437] [INFO] [RANK 0] replacing layer 6 attention with lora
[2024-07-18 15:06:11,644] [INFO] [RANK 0] replacing layer 7 attention with lora
[2024-07-18 15:06:11,851] [INFO] [RANK 0] replacing layer 8 attention with lora
[2024-07-18 15:06:12,125] [INFO] [RANK 0] replacing layer 9 attention with lora
[2024-07-18 15:06:12,333] [INFO] [RANK 0] replacing layer 10 attention with lora
[2024-07-18 15:06:12,469] [INFO] [RANK 0] replacing layer 11 attention with lora
[2024-07-18 15:06:12,655] [INFO] [RANK 0] replacing layer 12 attention with lora
[2024-07-18 15:06:12,857] [INFO] [RANK 0] replacing layer 13 attention with lora
[2024-07-18 15:06:13,064] [INFO] [RANK 0] replacing layer 14 attention with lora
[2024-07-18 15:06:13,325] [INFO] [RANK 0] replacing layer 15 attention with lora
[2024-07-18 15:06:13,541] [INFO] [RANK 0] replacing layer 16 attention with lora
[2024-07-18 15:06:13,763] [INFO] [RANK 0] replacing layer 17 attention with lora
[2024-07-18 15:06:14,028] [INFO] [RANK 0] replacing layer 18 attention with lora
[2024-07-18 15:06:14,241] [INFO] [RANK 0] replacing layer 19 attention with lora
[2024-07-18 15:06:14,443] [INFO] [RANK 0] replacing layer 20 attention with lora
[2024-07-18 15:06:14,642] [INFO] [RANK 0] replacing layer 21 attention with lora
[2024-07-18 15:06:14,843] [INFO] [RANK 0] replacing layer 22 attention with lora
[2024-07-18 15:06:15,035] [INFO] [RANK 0] replacing layer 23 attention with lora
[2024-07-18 15:06:15,226] [INFO] [RANK 0] replacing layer 24 attention with lora
[2024-07-18 15:06:15,443] [INFO] [RANK 0] replacing layer 25 attention with lora
[2024-07-18 15:06:15,626] [INFO] [RANK 0] replacing layer 26 attention with lora
[2024-07-18 15:06:15,832] [INFO] [RANK 0] replacing layer 27 attention with lora
[2024-07-18 15:06:15,997] [INFO] [RANK 0] replacing layer 28 attention with lora
[2024-07-18 15:06:16,190] [INFO] [RANK 0] replacing layer 29 attention with lora
[2024-07-18 15:06:16,437] [INFO] [RANK 0] replacing layer 30 attention with lora
[2024-07-18 15:06:16,639] [INFO] [RANK 0] replacing layer 31 attention with lora
[2024-07-18 15:06:16,846] [INFO] [RANK 0] replacing layer 32 attention with lora
[2024-07-18 15:06:17,052] [INFO] [RANK 0] replacing layer 33 attention with lora
[2024-07-18 15:06:17,250] [INFO] [RANK 0] replacing layer 34 attention with lora
[2024-07-18 15:06:17,453] [INFO] [RANK 0] replacing layer 35 attention with lora
[2024-07-18 15:06:17,652] [INFO] [RANK 0] replacing layer 36 attention with lora
[2024-07-18 15:06:17,926] [INFO] [RANK 0] replacing layer 37 attention with lora
[2024-07-18 15:06:18,139] [INFO] [RANK 0] replacing layer 38 attention with lora
[2024-07-18 15:06:18,348] [INFO] [RANK 0] replacing layer 39 attention with lora
[2024-07-18 15:06:18,540] [INFO] [RANK 0] replacing layer 40 attention with lora
[2024-07-18 15:06:18,741] [INFO] [RANK 0] replacing layer 41 attention with lora
[2024-07-18 15:06:18,934] [INFO] [RANK 0] replacing layer 42 attention with lora
[2024-07-18 15:06:19,126] [INFO] [RANK 0] replacing layer 43 attention with lora
[2024-07-18 15:06:19,346] [INFO] [RANK 0] replacing layer 44 attention with lora
[2024-07-18 15:06:19,545] [INFO] [RANK 0] replacing layer 45 attention with lora
[2024-07-18 15:06:19,745] [INFO] [RANK 0] replacing layer 46 attention with lora
[2024-07-18 15:06:19,930] [INFO] [RANK 0] replacing layer 47 attention with lora
[2024-07-18 15:06:20,122] [INFO] [RANK 0] replacing layer 48 attention with lora
[2024-07-18 15:06:20,327] [INFO] [RANK 0] replacing layer 49 attention with lora
[2024-07-18 15:06:20,534] [INFO] [RANK 0] replacing layer 50 attention with lora
[2024-07-18 15:06:20,733] [INFO] [RANK 0] replacing layer 51 attention with lora
[2024-07-18 15:06:20,970] [INFO] [RANK 0] replacing layer 52 attention with lora
[2024-07-18 15:06:21,163] [INFO] [RANK 0] replacing layer 53 attention with lora
[2024-07-18 15:06:21,424] [INFO] [RANK 0] replacing layer 54 attention with lora
[2024-07-18 15:06:21,643] [INFO] [RANK 0] replacing layer 55 attention with lora
[2024-07-18 15:06:21,842] [INFO] [RANK 0] replacing layer 56 attention with lora
[2024-07-18 15:06:22,030] [INFO] [RANK 0] replacing layer 57 attention with lora
[2024-07-18 15:06:22,230] [INFO] [RANK 0] replacing layer 58 attention with lora
[2024-07-18 15:06:22,433] [INFO] [RANK 0] replacing layer 59 attention with lora
[2024-07-18 15:06:22,580] [INFO] [RANK 0] replacing layer 60 attention with lora
[2024-07-18 15:06:22,780] [INFO] [RANK 0] replacing layer 61 attention with lora
[2024-07-18 15:06:23,041] [INFO] [RANK 0] replacing layer 62 attention with lora
[2024-07-18 15:06:23,776] [INFO] [RANK 0] find 0 files...
[2024-07-18 15:06:23,776] [INFO] [RANK 0] find 0 samples in all...
[rank3]: Traceback (most recent call last):
[rank3]: File "/GLOBALFS/dhu_mbzhao_1/CogVLM-main/finetune_demo/finetune_cogvlm_demo.py", line 256, in
[rank3]: model = training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=partial(create_dataset_function, image_processor, text_processor), collate_fn=data_collator, forward_step_eval=forward_step_eval)
[rank3]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/training/deepspeed_training.py", line 67, in training_main
[rank3]: train_data, val_data, test_data = make_loaders(args, hooks['create_dataset_function'], collate_fn=collate_fn)
[rank3]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 201, in make_loaders
[rank3]: train = make_dataset(**data_set_args, args=args, dataset_weights=args.train_data_weights, is_train_data=True)
[rank3]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 139, in make_dataset_full
[rank3]: scale = max(200, 1 + (args.train_iters * args.batch_size * args.gradient_accumulation_steps * world_size) // len(ds))
[rank3]: ZeroDivisionError: integer division or modulo by zero
[rank0]: Traceback (most recent call last):
[rank0]: File "/GLOBALFS/dhu_mbzhao_1/CogVLM-main/finetune_demo/finetune_cogvlm_demo.py", line 256, in
[rank0]: model = training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=partial(create_dataset_function, image_processor, text_processor), collate_fn=data_collator, forward_step_eval=forward_step_eval)
[rank0]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/training/deepspeed_training.py", line 67, in training_main
[rank0]: train_data, val_data, test_data = make_loaders(args, hooks['create_dataset_function'], collate_fn=collate_fn)
[rank0]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 201, in make_loaders
[rank0]: train = make_dataset(**data_set_args, args=args, dataset_weights=args.train_data_weights, is_train_data=True)
[rank0]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 139, in make_dataset_full
[rank0]: scale = max(200, 1 + (args.train_iters * args.batch_size * args.gradient_accumulation_steps * world_size) // len(ds))
[rank0]: ZeroDivisionError: integer division or modulo by zero
[rank2]: Traceback (most recent call last):
[rank2]: File "/GLOBALFS/dhu_mbzhao_1/CogVLM-main/finetune_demo/finetune_cogvlm_demo.py", line 256, in
[rank2]: model = training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=partial(create_dataset_function, image_processor, text_processor), collate_fn=data_collator, forward_step_eval=forward_step_eval)
[rank2]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/training/deepspeed_training.py", line 67, in training_main
[rank2]: train_data, val_data, test_data = make_loaders(args, hooks['create_dataset_function'], collate_fn=collate_fn)
[rank2]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 201, in make_loaders
[rank2]: train = make_dataset(**data_set_args, args=args, dataset_weights=args.train_data_weights, is_train_data=True)
[rank2]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 139, in make_dataset_full
[rank2]: scale = max(200, 1 + (args.train_iters * args.batch_size * args.gradient_accumulation_steps * world_size) // len(ds))
[rank2]: ZeroDivisionError: integer division or modulo by zero
[rank1]: Traceback (most recent call last):
[rank1]: File "/GLOBALFS/dhu_mbzhao_1/CogVLM-main/finetune_demo/finetune_cogvlm_demo.py", line 256, in
[rank1]: model = training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=partial(create_dataset_function, image_processor, text_processor), collate_fn=data_collator, forward_step_eval=forward_step_eval)
[rank1]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/training/deepspeed_training.py", line 67, in training_main
[rank1]: train_data, val_data, test_data = make_loaders(args, hooks['create_dataset_function'], collate_fn=collate_fn)
[rank1]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 201, in make_loaders
[rank1]: train = make_dataset(**data_set_args, args=args, dataset_weights=args.train_data_weights, is_train_data=True)
[rank1]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 139, in make_dataset_full
[rank1]: scale = max(200, 1 + (args.train_iters * args.batch_size * args.gradient_accumulation_steps * world_size) // len(ds))
[rank1]: ZeroDivisionError: integer division or modulo by zero
[2024-07-18 15:06:25,946] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56061
[2024-07-18 15:06:25,949] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56062
[2024-07-18 15:06:25,952] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56063
[2024-07-18 15:06:25,952] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56064
[2024-07-18 15:06:25,954] [ERROR] [launch.py:325:sigkill_handler] ['/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/bin/python', '-u', 'finetune_cogvlm_demo.py', '--local_rank=3', '--experiment-name', 'finetune-cogvlm-chat-v1.1', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '800', '--resume-dataloader', '--from_pretrained', 'cogvlm-chat-v1.1', '--max_length', '1288', '--lora_rank', '10', '--use_lora', '--local_tokenizer', '/GLOBALFS/dhu_mbzhao_1/CogVLM-main/vicuna-7b-v1.5', '--version', 'base', '--train-data', './archive_split/train', '--valid-data', './archive_split/valid', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--vit_checkpoint_activations', '--save-interval', '200', '--eval-interval', '200', '--save', './checkpoints', '--eval-iters', '10', '--eval-batch-size', '1', '--split', '1.', '--deepspeed_config', 'test_config_bf16.json', '--skip-init', '--seed', '2023'] exits with return code = 1

Could someone help me?

3 replies

link89 Jul 18, 2024

[rank1]: scale = max(200, 1 + (args.train_iters * args.batch_size * args.gradient_accumulation_steps * world_size) // len(ds))
[rank1]: ZeroDivisionError: integer division or modulo by zero

It failed to load your dataset so the len(ds) == 0

Originlightwkp Jul 19, 2024

OK, I will check it again. Hope to useful! Thank you.

GuGuGuGun Oct 12, 2024

i've had this issue as well，it should be change your dataset or modify the dataset.py

Originlightwkp · 2024-07-19T08:56:15Z

Originlightwkp
Jul 19, 2024

what is this mean? Is this log impact the performance?

Keyword arguments {'add_special_tokens': False} not recognized.

0 replies

Originlightwkp · 2024-07-19T09:54:13Z

Originlightwkp
Jul 19, 2024

My test results are as follows, is this considered good or bad?

[2024-07-19 09:38:21,589] [INFO] [RANK 0] validation loss at the end of training for test data | loss: 0.000000E+00 | PPL: 1.000000E+00 acc 5.319865E-02 | acc_w/o_case 5.319865E-02 |

1 reply

link89 Aug 19, 2024

The training script is based on deepspeed, you can config deepspeed to use wandb or tensorboard to track and visualize your experiment.

Akhim-yun · 2024-08-26T13:32:35Z

Akhim-yun
Aug 26, 2024

微调Cogagent的时候发现模型eval的过程中输出的pred是模型的回答，但是相对应的label却是问的问题，这是正常的嘛

0 replies

GuGuGuGun · 2024-10-12T14:24:00Z

GuGuGuGun
Oct 12, 2024

当我使用我自己构建的数据集，迭代一千次后loss很低，但是实际对话效果很差，经常出现错误的检测，是数据集的问题还是超参数设置的问题，用的base-490的模型进行lora微调
dataset.py我自己修改过，使用的json格式为 {
"image": "",
"question_id": "",
"prompt": "",
"label": ""
},
以下是部分日志
[2024-10-12 22:06:39,020] [INFO] [RANK 0] ---------------------------------------------------------------------------------------------------------------------------
[2024-10-12 22:06:39,020] [INFO] [RANK 0] validation loss at iteration 1800 | loss: 0.000000E+00 | PPL: 1.000000E+00 acc 4.000000E-01 | acc_w/o_case 4.000000E-01 |
[2024-10-12 22:06:39,020] [INFO] [RANK 0] ---------------------------------------------------------------------------------------------------------------------------
[2024-10-12 22:09:04,015] [INFO] [RANK 0] iteration 1850/ 2000 | elapsed time per iteration (ms): 3396.2 | learning rate 1.197E-06 | total loss 1.593439E-01 | loss 1.593439E-01 |speed 70.67 samples/(minGPU)
[2024-10-12 22:09:04,015] [INFO] [RANK 0] time (ms) | forward: 951.18 | backward: 1886.73 | allreduce: 0.00 | optimizer: 60.12 | batch generator: 78.80 | data loader: 0.27
[2024-10-12 22:11:28,781] [INFO] [RANK 0] iteration 1900/ 2000 | elapsed time per iteration (ms): 2895.3 | learning rate 1.107E-06 | total loss 2.068106E-01 | loss 2.068106E-01 |speed 82.89 samples/(minGPU)
[2024-10-12 22:11:28,783] [INFO] [RANK 0] time (ms) | forward: 962.60 | backward: 1878.24 | allreduce: 0.00 | optimizer: 52.55 | batch generator: 76.55 | data loader: 0.23
pred X光片分析结果：发现以下异常：肺不张，表现为肺叶或�� label X光片分析结果：未发现异常，X光片显示正常。
pred X光片分析结果：发现以下异常：肺不张，表现为肺叶或�� label X光片分析结果：未发现异常，X光片显示正常。
pred X光片分析结果：发现以下异常：肺不张，表现为肺 label X光片分析结果：发现以下异常：胸腔积液
pred X光片分析结果：未发现异常，X光片显示正常。 label X光片分析结果：发现以下异常：肺不张，表现为肺叶或
pred X光片分析结果：未发现异常，X光片显示正常。 label X光片分析结果：发现以下异常：气胸，表现为肺组��
pred X光片分析结果：未发现异常，X光片显示正常。 label X光片分析结果：发现以下异常：浸润，表现为肺野内的��
pred X光片分析结果：未发现异常，X光片显示正常。 label X光片分析结果：未发现异常，X光片显示正常。
pred X光片分析结果：未发现异常，X光片显示正常。 label X光片分析结果：未发现异常，X光片显示正常。
pred X光片分析结果：未发现异常，X光片显示正常。 label X光片分析结果：未发现异常，X光片显示正常。
pred X光片分析结果：发现以下异常：肺不张，表现为肺 label X光片分析结果：未发现异常，X光片显示正常。

0 replies

Finetune Problem 微调的各种问题 #270

Uh oh!

Uh oh!

zRzRzRzRzRzRzR Jan 2, 2024 Maintainer

Replies: 42 comments · 81 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zRzRzRzRzRzRzR Jan 15, 2024 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zRzRzRzRzRzRzR
Jan 2, 2024
Maintainer

Replies: 42 comments 81 replies

zRzRzRzRzRzRzR Jan 15, 2024
Maintainer Author