Releases: PaddlePaddle/PaddleNLP
Stable RL v1.0.0
GRPO、RF++ ready
v3.0.0-beta4
本次版本中,我们全面集成了 DeepSeek R1类的思考模型。推理团队深度优化了模型推理,速度业界领先。此外,我们还发布了自研PP-UIE信息抽取模型。本次重点更新如下。
重点更新:
-
模型新增
-
推理部署
- 全面支持DeepSeek V3/R1满血版FP8、INT8、4比特量化推理,MTP投机解码。
- FP8推理,单机输出超1000 tokens/s;4比特单机部署,输出超2100 tokens/s!
- 首次协同推理团队,发布统一推理部署镜像,热门模型一键部署。推理部署使用文档全面更新,体验全面提升!见文档。
- 全面支持DeepSeek V3/R1满血版FP8、INT8、4比特量化推理,MTP投机解码。
-
模型训练:
- 新增大模型 Embedding 训练,支持INF-CL超大batch size训练。
- 新增MergeKit模型融合工具,缓解对齐代价。见文档。
- 低资源训练 全面优化。16G小显存可以流畅训练。
-
其他重点特性:
- 文档页面,新增模型列表展示。用户可查看、下载对应模型文件。见文档。
- 训练新增 adam-mini 优化器。AdamW优化器支持 BF16 动量。
下面是一些对应的更新细节:
1. 模型、框架组件更新
- 模型新增
- 模型新增列表:
- paddlenlp/PP-UIE-0.5B, paddlenlp/PP-UIE-1.5B, paddlenlp/PP-UIE-7B, paddlenlp/PP-UIE-14B
- deepseek-ai/DeepSeek-V3, deepseek-ai/DeepSeek-V3-Base,deepseek-ai/DeepSeek-R1, deepseek-ai/DeepSeek-R1-Zero,
- deepseek-ai/DeepSeek-R1-Distill-Llama-70B, deepseek-ai/DeepSeek-R1-Distill-Llama-8B, deepseek-ai/DeepSeek-R1-Distill-Qwen-14B, deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, deepseek-ai/DeepSeek-R1-Distill-Qwen-32B, deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
- Qwen/Qwen2.5-7B-Instruct-1M,Qwen/Qwen2.5-14B-Instruct-1M, Qwen/QwQ-32B, Qwen/QwQ-32B-Preview
- PR #9738: Deepseek V3 模型新增。PR #9876: 增加 MTP 支持。PR #9797:修复 TP问题。 PR #9643: Deepseek llama3.3 新增模型说明(@DrownFish19)
- PR #9906: Deepseek V3 支持动态图直接加载 Float8 参数并进行推理 (@ZHUI)
- PR #9845: 新增PP-UIE系列模型 @Fantasy-02 i PR #9911 & PR #9913: PP-UIE 相关文档更新(@DrownFish19)
- 模型新增列表:
- Tokenizer 改进
- PR #9548、PR #9577、PR #9594: “Hackathon No.43” 系列,完善 TokenizerFast 功能支持(@yinfan98)
- PR #9745: 修复 AutoTokenizer 问题(@DrownFish19)PR #9837: 保存额外的 special tokens(@DesmonDay)
- Unified Checkpoint 相关:
- MergeKit 功能增强与优化
- 新增功能与优化
- PR #9561: 新增 mergekit_with_sparsify 功能,支持稀疏化合并(@Mangodadada)。
- PR #9702: 优化 MergeKit 的 GPU 支持,提升处理效率(@Mangodadada)。
- PR #9811: 添加 LoRA(低秩适配器)合并功能,扩展模型融合能力(@lugimzzz)。
- 工具更新与维护
- PR #9885: 对 MergeKit 工具进行代码更新与维护,优化整体逻辑。
- 日志与调试支持
- 新增功能与优化
- 低资源特性优化
- PR #9804: 添加 use_fused_linear_cross_entropy 支持,减小显存。加入 pre_divided_factor 避免FP16溢出。
- 文档更新、其他:
2. LLM 训练更新
- 通用训练
- PR #9204: 更新 chatglmv2 的 tensor/pipeline 并行(@DrownFish19)
- PR #9827: 为 Qwen2Moe 和 Deepseek 增加 pipeline 与 flashmask 支持(@DrownFish19)
- Embedding 训练
- PR #9508: Embedding trainer 新增(@DesmonDay)PR #9673: 增加 INF-CL 超大batch训练支持(@jie-z-0607)
- PR #9656: Trainer 中修复加载 rng 状态问题(@DesmonDay)
- PR #9721: 修复 embedding 随机性问题(@DesmonDay)
- DPO训练
- 新功能和特性
- PR #9542: 增加 adam-mini 优化器支持(@lugimzzz)
- PR #9732: 支持BF16动量adamw 训练 (@lugimzzz)
- PR #9830: 修复非 flash 模式下 checkpoint 保存的问题(@SylarTiaNII)
- PR #9705: Cherry-Pick:在 optimizer step 前校验 loss(@SylarTiaNII)
- PR #9704: Cherry-Pick:为 LLM 训练增加异步 metrics dumper(@SylarTiaNII)
- 训练文档及问题修复
3. Inference 更新
- Predictor & Flask 更新
- PR #9831: 修复 multibatch 推理问题(@DrownFish19)
- PR #9841: 修复 position_ids 相关问题(@DrownFish19)
- PR #9864: 更新 Deepseek 推理(@DrownFish19)
- PR #9828: Flask 服务使 Inference 兼容 OpenAI API(@ZHUI)
- MTP功能优化
- PR #9856: Inference 中支持 mtp 与 Deepseek-v3(@freeliuzc)
- PR #9894: 修复 Deepseek_v3 在多 GPU 模式下的 mtp 问题(@freeliuzc)
- PR #9936: 增加 mtp serving 支持(@freeliuzc)
- 部署优化
- PR #9872: 支持多机部署 LLM(@ltd0924)
- PR #9791: 合并 fastdeploy 部分代码(@kevincheng2)
- Kernel优化
- 文档更新、测试
- PR #9613: Inference 模块支持 llama3.2 及文档更新(@yuanlehome)
- PR #9921: 修复 llama 的 block_size 设置(@zhaohaixu)
- PR #9711: 为 LLM predictor 增加 common models 和参数单元测试(@aooxin)
4. AutoParallel / 分布式训练更新
- 自动并行
- PR #9578: 增加 llama2-7b-cinn 的测试(@zhangbo9674)
- 基础配置与 CI 集成
- PR #9538: 增加 qwen model_auto 与 CI(@blacksheep-Aristotle)
- PR #9541: 增加 llama3.1 自动并行配置(@zhiqiu)
- PR #9551: 为 gpt 和 baichuan 自动 CI 加入支持(@blacksheep-Aristotle)
- PR #9591: 增加 gpt、baichuan 及 qwen 的 ce 支持(@blacksheep-Aristotle)
- PR #9412: 增加 single_model 网络和使用 intermediate API(@blacksheep-Aristotle)
- PR #9943: 通过 training_args 控制 split input(@blacksheep-Aristotle)
- 测试、验证与功能开关
- PR #9621: 增加 PIR recompute 测试(@waliwali777)
- PR #9647: 修改 loss_base 以支持 dropout 后 SPMD(@deepllz)
- PR #9714: 增加阶段 1 tensor fusion 相关开关(@AndSonder)
- PR #9672: 修复 recompute 测试在 to_static=1 下运行问题(@waliwali777)
- PR #9688: 自动并行下合并 ckpt 供推理使用(@xuxinyi389)
- PR #9750 & PR #9753: 修复 ernine auto trainer 相关 CI 错误(@blacksheep-Aristotle)
- PR #9749: 为 benchmark 开启 tensor fusion(@AndSonder)
- PR #9810: 增加 sharding tensor fusion save/load 开关(@AndSonder)
- PR #9862: 支持 deepseekv2 下的 DP/MP(@xuxinyi389)
- PR #9823: 增加 support ppo ckpt 功能(@xuxinyi389)
5. CI、文档、Benchmark 及测试脚本更新
- CI 脚本及警告过滤
- PR #9547: 更新 CI 脚本(@Liujie0926)
- PR #9612: CI 中过滤 paddle.to_tensor 警告(@DrownFish19)
- PR #9626: 更新 a100 loss_base 配置(@Liujie0926)
- PR #9889: CI 脚本更新(@Liujie0926)
- PR #9524: LLM benchmark 中新增 qwen2.5-7b(@Liujie0926)
- PR #9662 & PR #9722: 更新 LLM_benchmark 脚本(@Liujie0926)
- 文档与说明改进
- PR #9585: 修复文档中失效链接(@DrownFish19)
- PR #9668: 更新 README.md(@ZHUI)
- PR #9785: 更新面向文档的 README(@ZHUI)
- PR #9746: 文档修复(@DrownFish19)
- PR #9725: 调整 benchmark 环境变量和模型配置(@XieYunshen)
- PR #9877: 修正 inference 和 servering 的文档(@ZHUI)
- PR #9834: 发布 DeepSeek 新闻及说明(@DrownFish19)
- PR #9922: 更正精调文档错误(@sijunhe)
- Benchmark 配置与测试
- PR #9651: 修复 benchmark 多机任务异常退出的问题(@XieYunshen)
- PR #9891: 更新 gpt-13b 在 dygraph 模式下的最佳配置(@liym27)
6. NPU/XPU 及硬件相关更新
- NPU 适配与修复
- PR #9499: 适配 NPU 用于 FusedHeadAndCrossEntropy(@tianhaodongbd)
- PR #9573: 修复 NPU 下的 where 问题(@tianhaodongbd)
- PR #9762: 适配新版 flash_attention_npu API(@will-jl944)
- XPU 功能与优化
- PR #9549: qwen2 支持 flash_attn on XPU(@will-jl944)
- PR #9660: qwen2 支持 fused_rope(@will-jl944)
- PR #9789: 支持 XPU 下的 empty_cache(@will-jl944)
- PR #9796: 支持 XPU 用于自动并行 LLaMa(@From00)
- PR #9854: 为 deepseek 增加 XPU 下 fused op(@QingshuChen)
7. Bug 修复、性能优化及其他改进
- 状态加载与多线程问题
- PR #9464: 修复多线程下 load_state_dict 的问题(@DesmonDay)
- 各类模型与算子问题修复
- PR #9603: 修复 qwen2 modeling 中 d2s bug(@wawltor)
- PR #9569: 修复 dynamic 与 static 模式下的 norm outputs 问题(@Wangzheee)
- PR #9652: 修复 paddle.where 问题(@will-jl944)
- PR #9638: 增加 config replace_with_c_embedding(@Xing-lil)
- PR #9699: 修复 loraga amp 问题(@greycooker)
- PR #9752: 修复 get_block_shape_and_split_kv_block 的 bug(@lizhenyun01)
- PR #9759: 修复 speculate_verify_and_update op(@Wanglongzhi2001)
- PR #9674: 将 speculate_step 合并到 step op 中(@Wanglongzhi2001)
- PR #9757: Trainer 模块中更新 sequence parallel(@DesmonDay)
- PR #9765: 修复 loraga merge 问题(@greycooker)
- PR #9777: 分布式训练下 Cherry-Pick 支持 fuse optimizer(@SylarTiaNII)
- PR #9783: 修复 ce 错误(@blacksheep-Aristotle)
- PR #9779: 修复 pickle unsafe-load 问题(@DrownFish19)
- PR #9760: MoE 模块修复 expert parallel(@DesmonDay)
- PR #9790: 为 server infer 添加 pir_model 路径(@aooxin)
- PR #9706: Cherry-Pick 集成 PDC SDK 用于 LLM 训练容错(@SylarTiaNII)
- PR #9624: 添加 FLAGS 用于替换四个参数以便更好地加速(@zhink)
- PR #9806: 修复 LLAMA 参数解析 bug(@will-jl944)
- PR #9829: 更新 mixtral.md 文件(@yuanlehome)
- PR #9859: 修复 dsk rope 差异问题(@yuanlehome)
8. 环境/依赖及版本兼容更新
- requirements 及安装更新
- PR #9514: 更新 py38 下的 requirements.txt (@ZHUI)
- PR #9118: 更新安装依赖(@DrownFish19)
- PR #9953: 针对 py38 增加 tokenizers 依赖(@DrownFish19)
- Python 版本兼容性
What's Changed
- Update requirements.txt for py38 by @ZHUI in #9514
- [Unified Checkpoint] fix single card loading without master weights by @DesmonDay in #9540
- Fix multi-threading load_state_dict by @DesmonDay in #9464
- delete generate_rank_mapping when export multi cards model by @yuanlehome in #9552
- [LLM] dpo support qwen2 with flashmask by @wtmlon in #9543
- [XPU] qwen2 supports flash_attn on XPU by @will-jl944 in #9549
- [AutoParallel]: add qwen model_auto and ci by @blacksheep-Aristotle in #9538
- add llama3.1 config for auto_parallel by @zhiqiu in #9541
- Add more model support for speculate_decoding and refactor speculate_decoding by @Wanglongzhi2001 in #9504
- [Intel_HPU]FSDPA custom kernel API update by @yanfeich in #9556
- [Unified Checkpoint] fix load missing keys by @DesmonDay in #9523
- 【Hackathon 7th No.43】完善 TokenizerFast 功能支持 part 3 by @yinfan98 in https...
v3.0.0-beta3
本次更新增强了PaddleNLP的基础体验,新增了Llama-3.2、DeepSeekV2模型,升级了TokenizerFast功能,重构了SFTTrainer。
此外,PaddleNLP还支持了优化器状态的卸载和重载功能,实现了精细化的重新计算,训练性能提升7%。在Unified Checkpoint方面,进一步优化了异步保存逻辑,新增Checkpoint压缩功能,可节省78.5%存储空间。
最后,在大模型推理、自动并行、多硬件支持、文档使用上,我们都进行了深度优化。
主要更新与增强
-
新增模型:
-
基础架构改进:
-
推理性能提升:
-
硬件兼容性扩展:
-
自动并行优化:
-
文档和测试更新:
本次更新标志着PaddleNLP的持续进步,为用户提供了更加全面、高效和稳定的NLP解决方案。我们期待在未来的版本中,继续为用户带来更多的创新和价值。
What's Changed
- [Unified Checkpoint] update async_save_info in develop by @DesmonDay in #9173
- add flashmask rm by @lugimzzz in #9154
- [LLM_INFER] Support quantized model from bos and fix docs by @yuanlehome in #9197
- fix ci not set no_proxy and modify tests in pir mode by @fightfat in #9205
- [Models] Add Llama-3.2 by @DrownFish19 in #9199
- move some auto_parallel args into class AutoTrainingArguments by @Wennie396 in #9155
- [Performance] Compatible with flashmask API rename upgrade by @GuoxiaWang in #9019
- [AutoParallel] add vpp align and pp amp test by @AndSonder in #9176
- fix auto ci return bug when run in v100 by @fightfat in #9216
- fix auto ci return bug when run in v100 by @AndSonder in #9228
- [LLM] Add tools for parameters by @Hanyonggong in #9137
- [AutoParallel] Add test for fuse_ffn and fuse_attention_qkv pass by @zhangbo9674 in #9203
- [CI] Fix ci import. by @ZHUI in #9239
- [Version] Update version info by @DrownFish19 in #9241
- [Auto Parallel] Adding align mode support by @zhangyuqin1998 in #9150
- [LLM INFER] top_p_sampling_reject support top_p=0 and custom seed by @gzy19990617 in #9202
- [INFER] update tune_cublaslt_gemm op and fix some bugs by @yuanlehome in #9222
- Reduce the time spent on git downloading third-party libraries by @vivienfanghuagood in #9246
- [PIR] fix pir open bugs by @yuanlehome in #9248
- Cherry-pick some PRs from incubate/paddlenlp-fleety by @sneaxiy in #9245
- [Unified Checkpoint] Support expert parallel by @DesmonDay in #9055
- [PIR] fix pir dt2st for chatglm_v2 by @yuanlehome in #9251
- Cherry-pick some PRs from incubate/paddlenlp-fleety by @LiYuRio in #9253
- [Unified Checkpoint] Fix generation config save by @DrownFish19 in #9223
- [AutoParallel] Fix tests for pass paddle AutoParallel CI by @liym27 in #9267
- change dataset by @lugimzzz in #9266
- [Unified Checkpoint] update async save logic by @DesmonDay in #9274
- add config file for model chatglm2,gemma,yuan by @Mangodadada in #9139
- Fix async hang by @DesmonDay in #9276
- [AutoParallel] Change llama test from sharding stage2 to stage1 by @zhangbo9674 in #9281
- [Tokenizer] Enable padding_side as call time kwargs by @DrownFish19 in #9258
- [Trainer] fix save_model by @DesmonDay in #9286
- [CI] Skip inference test cases by @DrownFish19 in #9270
- [LLM] Add deepseekv2 by @DrownFish19 in #9250
- [Tokenizer] Unify tokenizer _pad by @DrownFish19 in #9280
- [CI] Fix llm/alignment/rm/flashmask path by @DrownFish19 in #9289
- support attention mask using causal=True by @GuoxiaWang in #9268
- [FlashMask] Add FlashMask for Qwen2 by @DrownFish19 in #9264
- bug fix for xpu_parallel_matmul by @FeixLiu in #9297
- fix lora sharding v2 by @lugimzzz in #9300
- [LLM INFER] Append attn by @yuanlehome in #9244
- [Auto Parallel] fix bugs for split_batches_for_accumulation && fix bu… by @zhangyuqin1998 in #9217
- [Tokenizer] Fix TokenizerFast missing clean_up_tokenization_spaces by @dynamicheart in #9304
- clean llama static modeling file by @zhiqiu in #9301
- [Unified Checkpoint] Accelerate loading checkpoint by multi-thread by @Crystal-X-111 in #9034
- fix non-pipelinelayer to distributed by @gongel in #9310
- change the legacy to slm by @wawltor in #9311
- [TRL] Rename sft trainer. by @ZHUI in #9292
- [XPU] support unified ckpt function by @cqulilujia in #9312
- [LLM INFER] Fix some bugs and chatglm_v2 support block_attn by @yuanlehome in #9271
- [Readme] Add flash mask by @lugimzzz in #9219
- update llm infer docs by @yuanlehome in #9314
- [Unified Checkpoint] Add split param and refactor code by @DesmonDay in #9240
- [METAX] Support llama for MX C550 by @idontkonwher in #9186
- update QR code by @DrownFish19 in #9325
- add flash_attention on model chatglm_v2 by @Mangodadada in #9296
- fix readme by @Mangodadada in #9326
- [Unified Checkpoint] update non-merge checkpoint loading, move async_save_info.json location by @DesmonDay in #9321
- [paddle cpu inference]fix cpu doc by @bukejiyu in #9299
- [LLM INFER] add rope_theta for block_multihead_attention by @yuanlehome in #9334
- Fix pr 9334 by @yuanlehome in #9335
- fix parameter calculation in auto_parallel mode by @zhiqiu in #9327
- [Docs] Update flashmask by @DrownFish19 in #9330
- Update load_save_single_card.py by @DesmonDay in #9337
- Update README.md by @DrownFish19 in #9339
- [Tokenizer] Support reading Tiktoken tokenizer.model. by @lvdongyi in #9215
- align default custom black/white list for dygraph and static graph by @zhiqiu in #9340
- [intel_hpu] initial commit for intel_hpu support by @yanfeich in #9273
- Compatible with Tensor.to change to out_of_place. by @DrownFish19 in https://github.co...
v3.0.0-beta2
本次更新强化了PaddleNLP的基础设施,新增了Qwen2.5、Mixtral 8*22B模型并升级了Tokenizer功能,同时重命名了数据索引工具。
此外,还修复了MoE模型参数保存与加载等问题,提升了文本处理准确性,并更新了文档与测试用例。在推理性能、硬件支持及自动并行方面也进行了优化,包括支持更多模型与参数配置、多GPU推理、国产硬件支持增强以及分布式训练流程优化等。
核心变更与增强功能
-
基础设施强化:
-
问题修复:
-
文档与测试更新:
-
其他关键变更:
- 推理性能优化:
- 硬件支持拓展:
- 自动并行优化:
What's Changed
- [Unified checkpoint] update optimizer async save signal by @DesmonDay in #8975
- 更正run_dpo.py文件路径 by @Mangodadada in #8952
- fix the loss base in llama_align_dygraph_dy2st_auto_bs2_bf16_DP2-MP1-… by @winter-wang in #8986
- [Bug fix] fix skip consumed_samples twice bug by @zhangyuqin1998 in #8980
- fix pip error in legacy benchmarks by @fightfat in #8978
- 【auto_parallel】Add checkpoint convertor by @xingmingyyj in #8847
- [llm]update finetune.md by @lugimzzz in #8990
- tool_helpers升级后可以支持32766个数据集. by @JunnYu in #8994
- add DCU inference docs by @YanhuiDua in #8983
- [Distributed]Add loss nan/inf checker by @ForFishes in #8943
- 【llm】update docs by @lugimzzz in #8999
- [Feature] Fused Mixtral support by @penPenf28 in #8901
- [XPU] Add README.md for llama2-7b by @xiguapipi in #8979
- Add gcu llama readme by @EnflameGCU in #8950
- fix qwen model use_casual_mask by @deepllz in #9009
- [ZeroPadding] revert zero_padding #8973 by @DrownFish19 in #9003
- [LLM Inference] Fix step.cu bug by @yuanlehome in #8995
- Refine checkpoint converter by @zhangbo9674 in #9001
- [Feature] fused mixtral wint4 by @penPenf28 in #9013
- llm inference docs by @Sunny-bot1 in #8976
- [LLM Inference] Support Qwen2_Moe Inference Model by @CJ77Qi in #8892
- fix llama3 static run by @yuanlehome in #8849
- [paddle inference cpu]update cpu inference by @bukejiyu in #8984
- fix the tipc ce case by @wawltor in #8748
- [Cherry-pick] Add is_distributed field in sharding reshard param_meta by @sneaxiy in #9028
- [Tokenizer] Support for loading added_tokens_decoder by @DrownFish19 in #8997
- [Inference] Add a8w8(fp8) a8w8c8(int8) quant_type support by @lixcli in #9032
- Fix checker of nan/inf by @ForFishes in #9029
- [Cherry-pick] add comm buffer size (#8963) by @ForFishes in #9031
- [Unified Checkpoint] Update async save info by @DesmonDay in #8982
- [llm]support pad to max_length & fix sp bug by @lugimzzz in #9040
- [Bugfix] fix bias optional by @penPenf28 in #9037
- fix setup.py for llm inference by @yuanlehome in #9041
- [Inference] Add cutlass gemm dequant op by @gzy19990617 in #8909
- [Inference] update fakequant support by @lixcli in #9047
- add test for pir sequence parallel on llama model by @liym27 in #9015
- Fix moe save load by @Meiyim in #9045
- Update quantization.md by @ZHUI in #9057
- 【Fix】Initialize dp degree in single GPU by @greycooker in #9056
- fix bos download by @westfish in #9023
- [Inference] Update fakequant script by @lixcli in #9054
- [AutoParallel][PIR] Fit pir grad merge by @AndSonder in #8985
- [MLU] Support rms_norm_mlu by @PeiyuLau in #8504
- [Inference] support llama3 a8w8c8_fp8 inference and cutlass_fp8_gemm by @ckl117 in #8953
- [Inference] Qwen2 support fp8 inference by @ckl117 in #8954
- [Version] update version info by @DrownFish19 in #9060
- [NPU] Fix baichuan2-13b-chat infer by @ronny1996 in #9070
- [MLU] Fix Llama attrntion_mask in npu and mlu by @DrownFish19 in #9075
- Fix the memory overflow bug of the tune_cublaslt_gemm operator by @Hanyonggong in #9076
- [Inference] Fix weight_only_int4 bug by @lixcli in #9073
- [Auto Parallel] fix data stream bug of dist.to_static by @zhangyuqin1998 in #9077
- fix hang when Flag_dataloader_use_file_descriptor=True by @deepllz in #9080
- fix llm predict install error by @fightfat in #9088
- [PIR] add pir grad merge test by @AndSonder in #9074
- Update readme by @EnflameGCU in #9046
- [LLM] Add tensor parallel for chatglmv2 by @SevenSamon in #9014
- [data] update tool_helpers version and add unittest by @JunnYu in #9093
- fix baseline because of PR#8769 by @fightfat in #9092
- fix use paddle.incubate.jit.inference(model) errors by @chang-wenbin in #9016
- [CI] Fix paddlepaddle install by @DesmonDay in #9102
- [LLM] fix train on npu by @SylarTiaNII in #9101
- Disable ut by @zhangbo9674 in #9108
- [AutoParallel] Enable CI for gradclip by @JZ-LIANG in #9059
- [Inference] Remove ceval from run_finetune by @lixcli in #9100
- [Bugfix] fix multi-gpu infer by @penPenf28 in #9107
- 【Inference】fix step kernel by @gzy19990617 in #9122
- [DCU] fix DCU w8a8c8 GEMM shape by @YanhuiDua in #9115
- [Inference] FP8 gemm auto-tune by @ckl117 in #9094
- Open ut llama_align_dygraph_dy2st_pir_auto_grad_merge_bs2_fp32_DP1-MP1-PP1 by @zhangbo9674 in #9120
- [LLM Inference] Support Qwen2_Moe Inference with MultiGPU by @CJ77Qi in #9121
- [Unified Checkpoint] Fix uc lora config, fix release_grads by @DesmonDay in #9082
- [Inference]qwen2-a8w8c8 support use_fake_parameter by @ckl117 in #9109
- Add fast_ln spmd rules by @From00 in #9125
- fix pir dtype by @wanghuancoder in #9130
- Remove ring_flash_attention warning by @DrownFish19 in #9119
- [DOC] Fix LLM page 404 Not Found by @DrRyanHuang in #9127
- Add hardware flops for pretraining by @ZHUI in #9069
- [Benchmark] Fix amp level bug in some gpt tests by @zhangbo9674 in #9116
- [Auto Parallel] Fix ckpt_converter for auto_parallel by...
v3.0.0-beta1
PaddleNLP从v3.0.0-beta0升级至v3.0.0-beta1版本,带来了多项重要更新与增强。新引入了Yuan、mamba和jamba模型,并优化了LLM推理代码,提升了兼容性和效率。
基础性能优化方面,添加了快速分词器,实现了MoE优化器参数广播,加速了层归一化。同时,修复了多个bug,包括safetensors shape切片问题和Windows下mmap问题,提升了系统稳定性和兼容性。
文档与测试方面,进行了全面更新和优化,确保了文档的准确性和代码的可读性。此外,还增强了国产硬件支持,包括DCU和XPU的优化,以及PIR模式和自动并行的配置更新。
主要变更与新增功能
1. 新模型与特性引入
- 新模型:在#8654 中引入了Yuan模型;在#8513 和#8517 中分别添加了mamba和jamba新模型,并在后续Pull Request中修复了相关bug,确保了模型的稳定运行。
- LLM推理优化:通过多个Pull Request,我们优化了LLM推理代码,并新增了对新模型和参数的支持,进一步提升了推理效率和兼容性。
2. 基础性能优化
- 快速分词器:在#8832 中,我们添加了基于
tokenizers
库的快速分词器,显著提升了分词速度和性能。 - MoE优化:在#8810 中,我们实现了MoE(Mixture of Experts)优化器参数的广播,有效增强了模型训练的效率。
- 层归一化加速:通过多个Pull Request,我们添加了fast_rmsnorm,启用了use_fast_layer_norm,并更新了基准测试配置,进一步加速了模型训练过程。特别是在#8717 中,我们支持了在微调过程中使用use_fast_layer_norm,为用户提供了更多灵活性。
- 训练性能优化:在#8803 中,我们添加了
enable_sp_async_reduce_scatter
选项,有效优化了训练性能。 - 字典参数支持:在#8446 中,我们为trainer的argparser添加了支持字典参数的新特性,增强了参数传递的灵活性。同时,在#8904 中,我们更新了tensorboard的要求,确保了与最新版本的兼容性。
3. Bug修复
- safetensors修复:在#8702 中,我们修复了safetensors的形状问题。
- Windows系统mmap修复:在#8734 中修复了mmap问题,提升了windows的兼容性。
- 其他Bug修复:包括#8687 、#8730 等多个Pull Request中的bug修复。
4. 文档与测试更新
- 文档优化:在多个Pull Request中,我们进行了文档更新、代码风格清理和版本信息更新,确保了文档的准确性和可读性。
- README修复与增强:在#8741 中,我们修复了README中的断链问题;同时,多个贡献者更新了README文档,添加了新的测试用例,确保了文档与代码的同步更新。
5. 其他重要变更
国产硬件支持增强
- DCU支持:在#8580 中,我们实现了针对DCU的高性能LLM训练和推理,拓展了PaddleNLP的硬件支持范围。
- XPU优化:在#8527 中,我们为XPU添加了LoRA优化;在#8697 和#8710 中,我们分别实现了XPU的allgather功能和修复了统一检查点的gather问题,进一步提升了XPU上的模型训练效率。
PIR模式支持
- 导出与加载优化:在#8689 中,我们修改了PIR模式下llama模型的导出方式;在#8712 和#8766 中,我们支持了以三种模式(旧IR、PIR模型文件、PIR JSON文件)加载或保存Llama2-7b模型,为用户提供了更多灵活性和兼容性。
自动并行优化
- 配置更新:在#8679 中,我们更改了Llama2-7b配置中的
max_steps
以适应自动并行;在#8767 和#8828 中,我们优化了自动训练器的保存和加载功能;在#8750 中,我们更新了全局剪切的损失函数,进一步提升了自动并行的效率和准确性。
What's Changed
- [DCU] high performance LLM train and inference for DCU by @yuguo-Jack in #8580
- fix benchmark dir and add CUDA_DEVICE_MAX_CONNECTIONS to qwen by @fightfat in #8678
- bug fix by @wtmlon in #8687
- [XPU] add lora optimization by @dynamicheart in #8527
- [pir save] Modiy export llama model file in pir mode by @xiaoguoguo626807 in #8689
- [AutoParallel]Change
max_steps
in Llama2-7b config for auto-parallel. by @heavyrain-lzy in #8679 - [benchmark] Change the mirror source for pip by @mmglove in #8699
- update loss base of auto-parallel tests by @zhiqiu in #8701
- Add new mistral by @wtmlon in #7425
- [Safetensors] Fix safetensors shape by @DesmonDay in #8702
- [BUG] num_samples 向下去整, 防止prefrech预取时候超过数据集最大长度... by @JunnYu in #8690
- xpu use allgather by @FeixLiu in #8697
- add fast_rmsnorm by @deepllz in #8680
- enable use_fast_layer_norm for llama2 benchmark by @deepllz in #8714
- fix xpu gather for unified ckpt by @FeixLiu in #8710
- [inference] support load or save Llama2-7b in three patterns by @lizexu123 in #8712
- fix fast_ln backward by @deepllz in #8719
- finetune support use_fast_layer_norm by @tianhaodongbd in #8717
- bug fix by @FeixLiu in #8730
- disable lora by @lugimzzz in #8674
- [Safetensors] Fix mmap for Windows system by @DrownFish19 in #8734
- correct broken links in readme by @jzhang533 in #8741
- revert benchmark fix by @ronny1996 in #8747
- [LLM] Add Yuan model by @zhaogf01 in #8654
- fix nlp dir and auto_parallel_ci exit -6 by @fightfat in #8744
- [LLM] Update sequence parallel linear import by @DrownFish19 in #8706
- [Bug fixes] Fix ring attention by @zhangyuqin1998 in #8740
- update a100 loss by @zhiqiu in #8708
- [PaddleNLP 3.0] Update README by @DrownFish19 in #8681
- [AutoParallel] update loss for global clip by @JZ-LIANG in #8750
- [NPU] Fix sequence parallel lib import by @DrownFish19 in #8760
- [DEV] Update develop version show by @DrownFish19 in #8754
- [inference] support load or save Llama2-7b in three patterns by @lizexu123 in #8766
- add benchmark baichuan2 scripts by @fightfat in #8683
- Add the missing truncation=True in llm/predictor.py by @lszxb in #8768
- fix the ce for the unittest by @wawltor in #8772
- Enable parallel_config to use commas as delimiters. by @Difers in #8677
- fix incorrect token counting in
llm/predictor.py
by @lszxb in #8769 - Refine savable by @ZHUI in #8758
- [CodeStyle] remove markdownlint-cli by @DrownFish19 in #8779
- [XPU] use allgather and fp32 multinomial for XPU by @houj04 in #8787
- fix version show by @DrownFish19 in #8791
- [BUG] Add 20 redundant data in post pretrain by @JunnYu in #8789
- vera-pissa method added by @TranscenderNing in #8722
- update version by @DrownFish19 in #8792
- [Inference LLM] refine some code in llama wint8/4 by @yuanlehome in #8796
- [DCU] Llama a8w8 inference performance optimization by @Deleter-D in #8800
- [Prediction] Update LLM prediction. by @DesmonDay in #8778
- [Trainer] Add enable_sp_async_reduce_scatter by @DesmonDay in #8803
- [AutoParallel] Refine auto_trainer save load by @zhangbo9674 in #8767
- [MoE] Optimizer parameter broadcast by @DesmonDay in #8810
- [Doc] Update README by @DrownFish19 in #8817
- support Llama3.1 8B 128K generation on single GPU 80GB by @GuoxiaWang in #8811
- add paddle nv-embed-v1 by @Li-Z-Q in #8785
- fix pad_token_id bug by @yuanlehome in #8814
- [DCU] fix llama inference bug on DCU by @Deleter-D in #8815
- [Doc] Add LLaMA3.1 by @DrownFish19 in #8824
- [BUG] Fix build train valid test datasets by @JunnYu in #8826
- Add tune_cublaslt_gemm operator by cublaslt gemm algorithm and generate algo cache file by @Hanyonggong in #8799
- fix tune_cublaslt_gemm compile bug by @yuanlehome in #8844
- [AutoParallel] Refine save and load ckpt for auto_trainer by @zhangbo9674 in #8828
- [Unified Checkpoint] update merge tensor parallel by @DesmonDay in #8856
- [Trainer] update clear_grad by @DesmonDay in #8829
- [Unified Checkpoint] Fix tie_word_embeddings by @DesmonDay in #8795
- [Inference LLM] support static c8 by @yuanlehome in #8833
- support sft mapdataset by @greycooker in #8840
- Cherry pick some changes from incubate branch by @sneaxiy in #8862
- support nested list of dict inputs by @deepllz in #8876
- Fix the bug with issues code 8641. by @smallbenxiong in #8880
- Fix the issue of P-tuning official sample error by @guangyunms in #8884
- modify Paddlemix qwen dytostatic by @xiaoguoguo626807 in #8869
- [llm]fix zeropadding by @lugimzzz in #8895
- 修复fast_ln算子动半开启后报错 by @Wennie396 in #8891
- enable_sp_async_reduce_scatter for qwen_72b && llama2_70b by @deepllz in #8897
- Update run_pretrain.py by @...
v3.0.0-beta0
很高兴地通知大家,飞桨大模型套件发布v3.0.0beat版本:拥抱大模型,体验全升级。具体工作如下:
- 统一大模型工具链,实现国产计算芯片全流程接入;
- 全面支持飞桨4D并行配置、高效精调策略、高效对齐算法、高性能推理等大模型产业级应用流程;
- 自研极致收敛的RsLoRA+算法、自动扩缩容存储机制Unified Checkpoint和通用化支持FastFFN、FusedQKV助力大模型训推;
- 主流模型持续支持更新,提供高效解决方案。
大模型精调对齐训推优化
-
PEFT:
-
DPO:
-
国产芯片支持:
-
性能优化:
-
其他
- 新增模型内存监控 in #8269
模型新增
-
新增Gemma模型 in #8082
- google/gemma-7b
- google/gemma-7b-it
- google/gemma-2b
- google/gemma-2b-it
-
- meta-llama/Meta-Llama-3-8B
- meta-llama/Meta-Llama-3-8B-Instruct
- meta-llama/Meta-Llama-3-70B
- meta-llama/Meta-Llama-3-70B-Instruct
-
新增Qwen2模型 in #8338 #8584 #8601
- Qwen/Qwen1.5-0.5B
- Qwen/Qwen1.5-0.5B-Chat
- Qwen/Qwen1.5-1.8B
- Qwen/Qwen1.5-1.8B-Chat
- Qwen/Qwen1.5-4B
- Qwen/Qwen1.5-4B-Chat
- Qwen/Qwen1.5-7B
- Qwen/Qwen1.5-7B-Chat
- Qwen/Qwen1.5-14B
- Qwen/Qwen1.5-14B-Chat
- Qwen/Qwen1.5-32B
- Qwen/Qwen1.5-32B-Chat
- Qwen/Qwen1.5-72B
- Qwen/Qwen1.5-72B-Chat
- Qwen/Qwen1.5-110B
- Qwen/Qwen1.5-110B-Chat
- Qwen/Qwen1.5-MoE-A2.7B
- Qwen/Qwen1.5-MoE-A2.7B-Chat
- Qwen/Qwen2-0.5B
- Qwen/Qwen2-0.5B-Instruct
- Qwen/Qwen2-1.5B
- Qwen/Qwen2-1.5B-Instruct
- Qwen/Qwen2-7B
- Qwen/Qwen2-7B-Instruct
- Qwen/Qwen2-72B
- Qwen/Qwen2-72B-Instruct
- Qwen/Qwen2-57B-A14B
- Qwen/Qwen2-57B-A14B-Instruct
基础框架升级
-
功能优化:
-
AutoParallel优化
-
分布式能力优化:
-
chat能力优化:
- 增加Chat template in #8226
-
其他
问题修复
- 修复sharding数量小于100的bug in #8146
- 修复TP/PP参数合并问题 in #8239
- 修复tensor.shape与paddle.shape(tensor)不一致问题 in #8260
- 修复fp16+delay_scale_loss_scale+sharding_stage1_overlap的bug in #8314
- 增加pipelines运行文档及提示 in #8292 #8308 #8202 #8353
- 修复text feature extraction任务中tokenizer输入 in #8331
- 修复import error in #8332 #8367
结构调整
PaddleNLP文件结构调整 in #8609 #8613 #8605 #8614 #8617 #8626 #8618 #8625 #8619 #8629 #8601 #8627 #8666
What's Changed
- [dist]pip requirements-dev.txt by @Liujie0926 in #8258
- add scaling by @lugimzzz in #8256
- [LLM]Support Gemma model by @Southpika in #8082
- [BugFix] Try except sequence parallel utils by @DesmonDay in #8189
- Update CodeCov GitHub Action by @sijunhe in #8268
- [AutoParallel] Open recompute strategy for llama model by @zhangbo9674 in #8265
- Fix sharding < 100 limitation bug by @sneaxiy in #8146
- use tensor.shape bug not paddle.shape(tensor) by @wanghuancoder in #8260
- [dist CI]update paddlenlp install for CI by @Liujie0926 in #8267
- [Bug Fix]Fix merge parameters in pp by @Southpika in #8239
- [LLM] add memory stats to logger of trainer by @SylarTiaNII in #8269
- Add p2p_comm_overlap for Llama-2-70b benchmark. by @Xreki in #8276
- add a100 test ground truth by @zhiqiu in #8249
- [paddle-pipelines] faq semantic search question answering reamde by @w5688414 in #8292
- [paddle-pipelines] Add pipelines documentation by @w5688414 in #8308
- Support llama-3 by @ZHUI in #8307
- [Distributed] [CustomDevices] Adapt SP on lora && polish MC2 APIs by @SylarTiaNII in #8303
- fix bug for fp16 + delay_scale_loss_scale + sharding_stage1_overlap by @FeixLiu in #8314
- [paddle-pipelines] Update mkdocs by @w5688414 in #8310
- [benchmark]update llama2_ips by @Liujie0926 in #8322
- [dist CI]fix before_hook by @Liujie0926 in #8283
- benchmark llama worker=1 by @wanghuancoder in #8305
- 【AutoParallel】Add llama2 UT for auto-parallel by @heavyrain-lzy in #8300
- Add system env log for llama test by @zhangbo9674 in #8321
- [LLM] Support fuse attention q, k, v weights by @DrownFish19 in #8202
- [Distributed] fix lora by @SylarTiaNII in #8325
- fix try import by @w5688414 in https://github.com/PaddlePaddle/Pa...
v2.8.1
What's Changed
- [Trainer] Fix sharding overlap bug by @DesmonDay in #8334
- [Cherry-pick] update truncate by @KB-Ding in #8375
- [BugFix] Fix llama3
eot_id
. by @ZHUI in #8373 - [Trainer] update distributed dataloader by @DesmonDay in #8426
- [BugFix] Fix load rng compatibility. by @ZHUI in #8451
- Cherry pick/fast_safe_open by @ZHUI in #8458
- 【cherry pick】adapter new type promotion rule for Paddle 2.6 by @zxcd in #8463
- Quick fix from pretrained. by @ZHUI in #8487
- Release/2.8 by @Galaxy1458 in #8437
- Fix from_pretrained
os.path.split
by @DesmonDay in #8508 - [fea] Cherry-picked MOE updates from develop by @bo-ke in #8531
- [LLM] relocate tensor_parallel_output to avoid conflict (#8419) by @DesmonDay in #8533
- Update sequence_parallel for predict by @DesmonDay in #8547
- Cp/fix by @ZHUI in #8569
- Do not save moe_group by @DesmonDay in #8570
- [Release] 2.8.1 by @ZHUI in #8636
Full Changelog: v2.8.0...v2.8.1
v2.8.0
很高兴地通知大家,飞桨大模型套件发布v2.8.0版本。这个版本中,我们深度优化套件的大模型精调对齐的能力,提升大模型套件在国产计算硬件训推能力,具体工作如下:
- 特色精调和高效对齐:提供自研极致收敛的RsLoRA+算法,大幅提升PEFT训练收敛速度以及训练效果;引入高性能生成加速到RLHF PPO算法,打破 PPO 训练中生成速度瓶颈,PPO训练性能大幅领先。
- 大模型训练提速:通用化支持 FastFNN、FusedQKV等多个大模型训练性能优化方式,大模型训练更快、更稳定。
大模型精调对齐训推优化
- 精调
- 推理
- 新增QWenVL 的静态图推理 #7808
模型新增
- 新增QWenVL 的静态图推理 #7808
- 新增Deberta,Debertav2模型 #8227
- deepset/deberta-v3-large-squad2
- microsoft/deberta-v2-xlarge
- microsoft/deberta-v3-base
- microsoft/deberta-v3-large
- microsoft/deberta-base
- 新增mixtral-of-experts #7803
- mistralai/Mixtral-8x7B-Instruct-v0.1
- mistralai/Mixtral-8x7B-v0.1
- 新增LLama3 #8315
- meta-llama/Meta-llama-3-8b
- meta-llama/Meta-Llama-3-8B-Instruct
- meta-llama/Meta-llama-3-70b
- meta-llama/Meta-Llama-3-70B-Instruct
基础框架升级
- Trainer升级
- AutoParallel升级
- 其他
其他支持
- 新增俄罗斯套娃(matryoshka representation learning)检索策略,节省计算和存储资源。#8165
问题修复
- 日志级别修改,并增加timelog计时日志,兼容不同设备。#8261
- 修复pipeline并行中随机初始化的shared weights不一致的问题,覆盖GPT/OPT等模型。#7772
- 关闭CI及单测中从huggingface hub下载的逻辑 #7798 #8198
- 修复llm的gradio开启chat template时候重复拼接query 和 history的问题。#7992
- 修复GPT模型下载key error问题。#8253
- 修复LlamaRotaryEmbedding #7882
- 修复allreduce dtype的问题 #7876
- 修复框架侧dev分支清理 paddle.jit.dy2static.utils_helperAPI的问题 #7989
- 修复read-data timer在ignore_data_skip=False and skip_profile_timer=False 的问题。#8177
- 修复Wandb单测问题 #8066 #8056
- 修复Trainer同时解析json与命令行列表参数报错问题#7860
- 修复Gradio UI 中的推理问题 #7740 #7788
- 修复 Tokenizer 相关的基础问题 #7797 7870
- 修复 custom devices上loading rng state的问题。#7894
- 修复自动并行打印BF16的loss编码错乱的问题#7874
- 采用float初始化模型,修复静态图自动并行AMP报错问题#8033#8199
- 修复ShardDataloader接口在PipeLine Parallelism下使用错误问题#8014
- 修复llama在custom devices的精度问题。#7895
- 修复NPU AICPU算子问题 #7976
- 修复FusedLinearWithGradAdd少传参数的问题。#8178
What's Changed
- [Unified Checkpoint] Add unified checkpoint training args doc. by @DesmonDay in #7756
- [AutoParallel] Auto Trans PP to VPP by @zhaoyinglia in #7747
- Add codecov check by @zjjlivein in #7760
- [CE] Delete gpt_for_sequence_classification by @ZHUI in #7757
- [DOC] Update trainer.md by @ZHUI in #7761
- [Release] Change version to 2.7.0 by @ZHUI in #7764
- [benchmark]close skip_memory_metrics for ips by @Liujie0926 in #7732
- [Release] Update release.yml to release tags by @ZHUI in #7765
- [AutoParallel] Add Sequence Parallel for Static LLaMA by @JZ-LIANG in #7746
- [New Features] support dynamic src_length by @wj-Mcat in #7740
- Fix unified_checkpoint bug by @DrownFish19 in #7770
- [DONE] aistudio, hf hub, bos update download by @JunnYu in #7608
- [Trainer] Fix dist dataloader eval by @DesmonDay in #7777
- [Paddle-pipelines] Update convert_files_to_dicts_splitter by @w5688414 in #7748
- [PEFT]fix lora model tp when existing other trainable module by @lugimzzz in #7781
- [Paddle-Pipelines] update faiss by @qingzhong1 in #7793
- Fix shared weights sync for PipelineLayer by @DrownFish19 in #7772
- [tests] download slow by @JunnYu in #7798
- [INFER][LLM] Support qwen in fined grained dybatch v1 by @DanGuge in #7644
- Add CE for Distributed Hybrid Parallel by @iosmers in #7782
- add MP2-SP2-pp4-vpp2-SD2-stage1-mbs2-acc8 ce by @tianhaodongbd in #7774
- [Pretrain] Fix eval during pretrain by @DesmonDay in #7806
- pipeline parallel benchmark by @zhangting2020 in #7759
- [Bug fixes] fix br gradio by @wj-Mcat in #7788
- delete useless code for write_cache_kv.cu by @yuanlehome in #7812
- [llm]support qlora pp by @lugimzzz in #7801
- Trainer support simultaneously parse JSON files and cmd arguments. by @greycooker in #7768
- [LLM] Support block_attention/cachekv quant for llama by @RichardWooSJTU in #7649
- [Bug Fix] fix paddle multipy_fwd_func warning message by @BeingGod in #7818
- [llm]fix lora by @lugimzzz in #7824
- fused rms spmd by @liuzhenhai93 in #7830
- [Pretrain] Fix eval during pretrain by @DesmonDay in #7827
- [neural search][fix bug of evaluate.py] by @ZeyuTeng96 in #7832
- [neural search] fix the bug of reading files when calculating the recall scores by @shenghwa in #7836
- [Bug fixes] update chatglm tokenizer by @wj-Mcat in #7797
- [semantic_indexing] fix bug of evaluate.py by @ZeyuTeng96 in #7843
- [faq] fix bug of evaluate.py by @ZeyuTeng96 in #7840
- [text_classification_retrieval_based] fix bug of evaluate.py by @ZeyuTeng96 in #7844
- [LLM] add Qwen-7B-Chat to PaddleNLP unit test by @ziangqin-baidu in #7823
- Support 5.2 bloom by @zhoutianzi666 in #7846
- [unified checkpoint] Fix last checkpoint save by @DrownFish19 in #7854
- [unified checkpoint] fix checkpoint names by @DrownFish19 in #7795
- [New Features]add ranks testing for test_predictor by @wj-Mcat in #7800
- [Auto Parallel] Support dynamic semi-auto training in Llama2 model by @haohongxiang in #7851
- [CI] add ci approval pipelines by @zjjlivein in #7859
- [fix] fix a bug of trainer/argparser.py by @greycooker in #7860
- [Improvement] fix ops improting in utils by @wj-Mcat in #7865
- [Add CE] Add CE for Hybrid Parallism by @iosmers in #7817
- [Unified Checkpoint] Cherry pick empty cache. by @ZHUI in #7868
- Add PPO training. by @guoshengCS in #7305
- Update reward_main.py by @wawltor in #7880
- Update ppo_main.py by @wawltor in #7881
- [LLM] revert benchmark codes by @RichardWooSJTU in #7871
- [LLM]support QWenVL second part by @DanGuge in #7808
- [Bug Fixes] update chatglm1 tokenizer by @wj-Mcat in #7870
- 【AutoParallel】Support 'master_grad' in Llama in static auto-parallelism by @heavyrain-lzy in #7658
- [Bug Fix] fix slice bug in LlamaRotaryEmbedding by @MarioLulab in #7882
- 【AutoParallel】Support bf16 loss in static by @heavyrain-lzy in #7874
- [Bug Fix] fix allreduce tensor dtype by @BeingGod in #7876
- [CE] Add Qwen into CE process by @ziangqin-baidu in #7887
- [Hackathon 5th No.73] ToT by @ErnestinaQiu in #7660
- [CustomDevice] fix loading rng state on custom devices by @SylarTiaNII in #7894
- [LLM] ...
v2.7.2
本版本做了一些小问题的修复
What's Changed
- [Unified Checkpoint] fix checkpoint names by @DrownFish19 in #7794
- [Unified Checkpoint] Fix last checkpoint save by @DrownFish19 in #7810
- [PEFT] Cherry pick lora fix by @lugimzzz in #7826
- [Unified Checkpoint] Fix unified checkpoint by empty cache. by @ZHUI in #7855
- [Fix Download] update converted logic & fix hf hub download subfolder bug by @JunnYu in #7911
- [Cherry-pick] logger level by @KB-Ding in #7920
- [Cherry-pick] RuntimeTimer for the toolkit (#7913) by @KB-Ding in #7921
- [Release] 2.7.2 for paddlenlp bugfix. by @ZHUI in #7892
Full Changelog: v2.7.1...v2.7.2
v2.7.1
本版本做了一些小问题的修复
What's Changed
- 修复了训练恢复遇到的一些问题 @ZHUI in #7771
- 修复了GPT在Pipeline模式下的初始化问题 @DrownFish19 in #7775
- 修复了dist dataloader评估时的问题。 @DesmonDay in #7778
Full Changelog: v2.7.0...v2.7.1