Releases · PaddlePaddle/PaddleNLP

21 May 04:16

gongel

rl-v1.0.0

3b0c8ff

Stable RL v1.0.0

GRPO、RF++ ready

Assets 2

12 Mar 08:19

ZHUI

v3.0.0-beta4

a286abc

v3.0.0-beta4 Pre-release

Pre-release

本次版本中，我们全面集成了 DeepSeek R1类的思考模型。推理团队深度优化了模型推理，速度业界领先。此外，我们还发布了自研PP-UIE信息抽取模型。本次重点更新如下。

重点更新：

模型新增
- DeepSeek V3/R1, R1-distill, QwQ-32B 热门思考模型，全面支持。用户可以点击官方模型文档列表查看、下载所有模型。
- 飞桨自研发布下一代通用信息抽取工具 PP-UIE 全新发布。支持8K长度信息抽取。使用文档。
推理部署
- 全面支持DeepSeek V3/R1满血版FP8、INT8、4比特量化推理，MTP投机解码。
  - FP8推理，单机输出超1000 tokens/s；4比特单机部署，输出超2100 tokens/s！
- 首次协同推理团队，发布统一推理部署镜像，热门模型一键部署。推理部署使用文档全面更新，体验全面提升！见文档。
模型训练：
- 新增大模型 Embedding 训练，支持INF-CL超大batch size训练。
- 新增MergeKit模型融合工具，缓解对齐代价。见文档。
- 低资源训练全面优化。16G小显存可以流畅训练。
其他重点特性：
- 文档页面，新增模型列表展示。用户可查看、下载对应模型文件。见文档。
- 训练新增 adam-mini 优化器。AdamW优化器支持 BF16 动量。

下面是一些对应的更新细节：

1. 模型、框架组件更新

模型新增
- 模型新增列表：
  - paddlenlp/PP-UIE-0.5B, paddlenlp/PP-UIE-1.5B, paddlenlp/PP-UIE-7B, paddlenlp/PP-UIE-14B
  - deepseek-ai/DeepSeek-V3, deepseek-ai/DeepSeek-V3-Base，deepseek-ai/DeepSeek-R1, deepseek-ai/DeepSeek-R1-Zero,
  - deepseek-ai/DeepSeek-R1-Distill-Llama-70B, deepseek-ai/DeepSeek-R1-Distill-Llama-8B, deepseek-ai/DeepSeek-R1-Distill-Qwen-14B, deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, deepseek-ai/DeepSeek-R1-Distill-Qwen-32B, deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
  - Qwen/Qwen2.5-7B-Instruct-1M，Qwen/Qwen2.5-14B-Instruct-1M, Qwen/QwQ-32B, Qwen/QwQ-32B-Preview
- PR #9738: Deepseek V3 模型新增。PR #9876: 增加 MTP 支持。PR #9797:修复 TP问题。 PR #9643: Deepseek llama3.3 新增模型说明（@DrownFish19）
- PR #9906: Deepseek V3 支持动态图直接加载 Float8 参数并进行推理 (@ZHUI)
- PR #9845: 新增PP-UIE系列模型 @Fantasy-02 i PR #9911 & PR #9913: PP-UIE 相关文档更新（@DrownFish19）
Tokenizer 改进
- PR #9548、PR #9577、PR #9594: “Hackathon No.43” 系列，完善 TokenizerFast 功能支持（@yinfan98）
- PR #9745: 修复 AutoTokenizer 问题（@DrownFish19）PR #9837: 保存额外的 special tokens（@DesmonDay）
Unified Checkpoint 相关:
- PR #9540: 修复加载master weight PR #9523: 修复缺失key问题。
- PR #9669: 统一检查点的 Bug 修复 PR #9935: 针对忽略 merge optimizer 时直接加载参数的问题进行修复
- PR #9741 & PR #9821: 修复专家并行支持问题
MergeKit 功能增强与优化
- 新增功能与优化
  - PR #9561: 新增 mergekit_with_sparsify 功能，支持稀疏化合并（@Mangodadada）。
  - PR #9702: 优化 MergeKit 的 GPU 支持，提升处理效率（@Mangodadada）。
  - PR #9811: 添加 LoRA（低秩适配器）合并功能，扩展模型融合能力（@lugimzzz）。
- 工具更新与维护
  - PR #9885: 对 MergeKit 工具进行代码更新与维护，优化整体逻辑。
- 日志与调试支持
  - PR #9948: 添加日志记录功能，增强调试与过程追踪能力（@lugimzzz）。
低资源特性优化
- PR #9804: 添加 use_fused_linear_cross_entropy 支持，减小显存。加入 pre_divided_factor 避免FP16溢出。
文档更新、其他：
- PR #9634: unified_checkpoint 文档更新
- PR #9734: 自定义设备代码重构（@ZHUI）
- PR #9715: 增加 offload_recompute_inputs（@will-jl944）
- PR #9800: 增加训练 token 计数功能（@lugimzzz）

2. LLM 训练更新

通用训练
- PR #9204: 更新 chatglmv2 的 tensor/pipeline 并行（@DrownFish19）
- PR #9827: 为 Qwen2Moe 和 Deepseek 增加 pipeline 与 flashmask 支持（@DrownFish19）
Embedding 训练
- PR #9508: Embedding trainer 新增（@DesmonDay）PR #9673: 增加 INF-CL 超大batch训练支持（@jie-z-0607）
- PR #9656: Trainer 中修复加载 rng 状态问题（@DesmonDay）
- PR #9721: 修复 embedding 随机性问题（@DesmonDay）
DPO训练
- PR #9543: LLM 模块中 dpo 对 qwen2 的 flashmask 支持（@wtmlon）
- PR #9620: 更新 dpo criterion（@lugimzzz）
- PR #9695: 支持 qwen 与 llama 的 dpo pp（@lugimzzz）
新功能和特性
- PR #9542: 增加 adam-mini 优化器支持（@lugimzzz）
- PR #9732: 支持BF16动量adamw 训练 (@lugimzzz)
- PR #9830: 修复非 flash 模式下 checkpoint 保存的问题（@SylarTiaNII）
- PR #9705: Cherry-Pick：在 optimizer step 前校验 loss（@SylarTiaNII）
- PR #9704: Cherry-Pick：为 LLM 训练增加异步 metrics dumper（@SylarTiaNII）
训练文档及问题修复
- PR #9689: 增加 KTO 功能（@lugimzzz）
- PR #9655: 更新 peft 文档（@lugimzzz）
- PR #9659: 修复 lora 相关问题（@lugimzzz）

3. Inference 更新

Predictor & Flask 更新
- PR #9831: 修复 multibatch 推理问题（@DrownFish19）
- PR #9841: 修复 position_ids 相关问题（@DrownFish19）
- PR #9864: 更新 Deepseek 推理（@DrownFish19）
- PR #9828: Flask 服务使 Inference 兼容 OpenAI API（@ZHUI）
MTP功能优化
- PR #9856: Inference 中支持 mtp 与 Deepseek-v3（@freeliuzc）
- PR #9894: 修复 Deepseek_v3 在多 GPU 模式下的 mtp 问题（@freeliuzc）
- PR #9936: 增加 mtp serving 支持（@freeliuzc）
部署优化
- PR #9872: 支持多机部署 LLM（@ltd0924）
- PR #9791: 合并 fastdeploy 部分代码（@kevincheng2）
Kernel优化
- PR #9707: 优化 gemm_dequant OP，利用 CUDA 核进行 int8_sq 运算（@zhink）
文档更新、测试
- PR #9613: Inference 模块支持 llama3.2 及文档更新（@yuanlehome）
- PR #9921: 修复 llama 的 block_size 设置（@zhaohaixu）
- PR #9711: 为 LLM predictor 增加 common models 和参数单元测试（@aooxin）

4. AutoParallel / 分布式训练更新

自动并行
- PR #9578: 增加 llama2-7b-cinn 的测试（@zhangbo9674）
基础配置与 CI 集成
- PR #9538: 增加 qwen model_auto 与 CI（@blacksheep-Aristotle）
- PR #9541: 增加 llama3.1 自动并行配置（@zhiqiu）
- PR #9551: 为 gpt 和 baichuan 自动 CI 加入支持（@blacksheep-Aristotle）
- PR #9591: 增加 gpt、baichuan 及 qwen 的 ce 支持（@blacksheep-Aristotle）
- PR #9412: 增加 single_model 网络和使用 intermediate API（@blacksheep-Aristotle）
- PR #9943: 通过 training_args 控制 split input（@blacksheep-Aristotle）
测试、验证与功能开关
- PR #9621: 增加 PIR recompute 测试（@waliwali777）
- PR #9647: 修改 loss_base 以支持 dropout 后 SPMD（@deepllz）
- PR #9714: 增加阶段 1 tensor fusion 相关开关（@AndSonder）
- PR #9672: 修复 recompute 测试在 to_static=1 下运行问题（@waliwali777）
- PR #9688: 自动并行下合并 ckpt 供推理使用（@xuxinyi389）
- PR #9750 & PR #9753: 修复 ernine auto trainer 相关 CI 错误（@blacksheep-Aristotle）
- PR #9749: 为 benchmark 开启 tensor fusion（@AndSonder）
- PR #9810: 增加 sharding tensor fusion save/load 开关（@AndSonder）
- PR #9862: 支持 deepseekv2 下的 DP/MP（@xuxinyi389）
- PR #9823: 增加 support ppo ckpt 功能（@xuxinyi389）

5. CI、文档、Benchmark 及测试脚本更新

CI 脚本及警告过滤
- PR #9547: 更新 CI 脚本（@Liujie0926）
- PR #9612: CI 中过滤 paddle.to_tensor 警告（@DrownFish19）
- PR #9626: 更新 a100 loss_base 配置（@Liujie0926）
- PR #9889: CI 脚本更新（@Liujie0926）
- PR #9524: LLM benchmark 中新增 qwen2.5-7b（@Liujie0926）
- PR #9662 & PR #9722: 更新 LLM_benchmark 脚本（@Liujie0926）
文档与说明改进
- PR #9585: 修复文档中失效链接（@DrownFish19）
- PR #9668: 更新 README.md（@ZHUI）
- PR #9785: 更新面向文档的 README（@ZHUI）
- PR #9746: 文档修复（@DrownFish19）
- PR #9725: 调整 benchmark 环境变量和模型配置（@XieYunshen）
- PR #9877: 修正 inference 和 servering 的文档（@ZHUI）
- PR #9834: 发布 DeepSeek 新闻及说明（@DrownFish19）
- PR #9922: 更正精调文档错误（@sijunhe）
Benchmark 配置与测试
- PR #9651: 修复 benchmark 多机任务异常退出的问题（@XieYunshen）
- PR #9891: 更新 gpt-13b 在 dygraph 模式下的最佳配置（@liym27）

6. NPU/XPU 及硬件相关更新

NPU 适配与修复
- PR #9499: 适配 NPU 用于 FusedHeadAndCrossEntropy（@tianhaodongbd）
- PR #9573: 修复 NPU 下的 where 问题（@tianhaodongbd）
- PR #9762: 适配新版 flash_attention_npu API（@will-jl944）
XPU 功能与优化
- PR #9549: qwen2 支持 flash_attn on XPU（@will-jl944）
- PR #9660: qwen2 支持 fused_rope（@will-jl944）
- PR #9789: 支持 XPU 下的 empty_cache（@will-jl944）
- PR #9796: 支持 XPU 用于自动并行 LLaMa（@From00）
- PR #9854: 为 deepseek 增加 XPU 下 fused op（@QingshuChen）

7. Bug 修复、性能优化及其他改进

状态加载与多线程问题
- PR #9464: 修复多线程下 load_state_dict 的问题（@DesmonDay）
各类模型与算子问题修复
- PR #9603: 修复 qwen2 modeling 中 d2s bug（@wawltor）
- PR #9569: 修复 dynamic 与 static 模式下的 norm outputs 问题（@Wangzheee）
- PR #9652: 修复 paddle.where 问题（@will-jl944）
- PR #9638: 增加 config replace_with_c_embedding（@Xing-lil）
- PR #9699: 修复 loraga amp 问题（@greycooker）
- PR #9752: 修复 get_block_shape_and_split_kv_block 的 bug（@lizhenyun01）
- PR #9759: 修复 speculate_verify_and_update op（@Wanglongzhi2001）
- PR #9674: 将 speculate_step 合并到 step op 中（@Wanglongzhi2001）
- PR #9757: Trainer 模块中更新 sequence parallel（@DesmonDay）
- PR #9765: 修复 loraga merge 问题（@greycooker）
- PR #9777: 分布式训练下 Cherry-Pick 支持 fuse optimizer（@SylarTiaNII）
- PR #9783: 修复 ce 错误（@blacksheep-Aristotle）
- PR #9779: 修复 pickle unsafe-load 问题（@DrownFish19）
- PR #9760: MoE 模块修复 expert parallel（@DesmonDay）
- PR #9790: 为 server infer 添加 pir_model 路径（@aooxin）
- PR #9706: Cherry-Pick 集成 PDC SDK 用于 LLM 训练容错（@SylarTiaNII）
- PR #9624: 添加 FLAGS 用于替换四个参数以便更好地加速（@zhink）
- PR #9806: 修复 LLAMA 参数解析 bug（@will-jl944）
- PR #9829: 更新 mixtral.md 文件（@yuanlehome）
- PR #9859: 修复 dsk rope 差异问题（@yuanlehome）

8. 环境/依赖及版本兼容更新

requirements 及安装更新
- PR #9514: 更新 py38 下的 requirements.txt （@ZHUI）
- PR #9118: 更新安装依赖（@DrownFish19）
- PR #9953: 针对 py38 增加 tokenizers 依赖（@DrownFish19）
Python 版本兼容性
- PR #9853: 解决类型注解在不同 Python 版本下的兼容性问题（@zty-king）

What's Changed

Update requirements.txt for py38 by @ZHUI in #9514
[Unified Checkpoint] fix single card loading without master weights by @DesmonDay in #9540
Fix multi-threading load_state_dict by @DesmonDay in #9464
delete generate_rank_mapping when export multi cards model by @yuanlehome in #9552
[LLM] dpo support qwen2 with flashmask by @wtmlon in #9543
[XPU] qwen2 supports flash_attn on XPU by @will-jl944 in #9549
[AutoParallel]: add qwen model_auto and ci by @blacksheep-Aristotle in #9538
add llama3.1 config for auto_parallel by @zhiqiu in #9541
Add more model support for speculate_decoding and refactor speculate_decoding by @Wanglongzhi2001 in #9504
[Intel_HPU]FSDPA custom kernel API update by @yanfeich in #9556
[Unified Checkpoint] fix load missing keys by @DesmonDay in #9523
【Hackathon 7th No.43】完善 TokenizerFast 功能支持 part 3 by @yinfan98 in https...

Contributors

QingshuChen, zhiqiu, and 62 other contributors

Assets 2

16 Dec 09:35

ZHUI

v3.0.0-beta3

418c3a5

v3.0.0-beta3 Latest

Latest

本次更新增强了PaddleNLP的基础体验，新增了Llama-3.2、DeepSeekV2模型，升级了TokenizerFast功能，重构了SFTTrainer。

此外，PaddleNLP还支持了优化器状态的卸载和重载功能，实现了精细化的重新计算，训练性能提升7%。在Unified Checkpoint方面，进一步优化了异步保存逻辑，新增Checkpoint压缩功能，可节省78.5%存储空间。
最后，在大模型推理、自动并行、多硬件支持、文档使用上，我们都进行了深度优化。

主要更新与增强

新增模型：
- 新增了Llama-3.2模型（#9199）、DeepSeekV2模型（#9250），进一步丰富了大型模型的选择。
基础架构改进：
- 重构了SFTTrainer和SFTConfig，提高了代码的可维护性。（#9318)
- 支持优化器状态的卸载和重载功能（#9467），有效降低了内存使用。
- 通过Hook实现了精细化的重新计算支持，例如，在llama模型上，训练性能可提升7%。（#9396）
- Unified Checkpoint优化：
  - 更新了异步保存逻辑（#9173, #9274, #9321），显著提升了检查点的保存与加载效率。
  - 增加了对专家并行的支持（#9055），使模型训练更加灵活。
  - 支持在开启sharding_comm_overlap时使用Unified Checkpoint。（#9392）
  - 新增了Checkpoint压缩功能，最多可节省78.5%的存储空间。（#9183）
  - 通过多线程技术减少了检查点的加载时间（#9034）。
- Tokenizer功能增强：
  - 允许在Tokenizer调用时指定padding_side参数（#9258），提升了用户体验。
  - Qwen tokenizer现支持添加特殊标记（#9344），增强了其灵活性。
  - 修复了TokenizerFast中缺失的clean_up_tokenization_spaces问题（#9304），提高了文本处理的准确性。
  - 统一了分词器的_pad函数到基类。#9280
  - 新增了对BertTokenizerFast的支持，并允许在调用时注册tokenizer。（#9353）
  - 改进了Qwen、Gemma、Yuan模型chat template的特殊输入处理。（#9462）
推理性能提升：
- 支持LLM推理直接量化内置bos模型（#9197）。
- 加强了对LLM推理中FP8 量化的支持（如#9328, #9423），满足了多样化的精度需求。
- 增强了投机解码（speculative decoding）和Append Attention 的支持。(#9180) (#9244)
硬件兼容性扩展：
- 加强了对Intel HPU的支持（#9273），现在支持动态图预测。
- 为XPU等国产硬件提供了统一检查点功能（#9312）。
- 修复了XPU和DCU支持中的错误，并提升了性能。#9414 和#9433
自动并行优化：
- 修复了自动并行过程中的多个问题（如#9217, #9355），确保了并行训练的稳定性。
- 更新了自动并行配置与检查点转换器（如#9136, #9432），提升了训练的灵活性和稳定性。
文档和测试更新：
- 更新了多个文档，包括LLM模型文档（如#9314）和量化文档（如#9330），确保了信息的时效性和准确性。
- 新增了多个测试用例，如分布式数据加载测试（#9438），提高了测试的覆盖率。
- 修复了文档中的链接错误和排版问题（如#9127, #9515），提升了用户体验。

本次更新标志着PaddleNLP的持续进步，为用户提供了更加全面、高效和稳定的NLP解决方案。我们期待在未来的版本中，继续为用户带来更多的创新和价值。

What's Changed

[Unified Checkpoint] update async_save_info in develop by @DesmonDay in #9173
add flashmask rm by @lugimzzz in #9154
[LLM_INFER] Support quantized model from bos and fix docs by @yuanlehome in #9197
fix ci not set no_proxy and modify tests in pir mode by @fightfat in #9205
[Models] Add Llama-3.2 by @DrownFish19 in #9199
move some auto_parallel args into class AutoTrainingArguments by @Wennie396 in #9155
[Performance] Compatible with flashmask API rename upgrade by @GuoxiaWang in #9019
[AutoParallel] add vpp align and pp amp test by @AndSonder in #9176
fix auto ci return bug when run in v100 by @fightfat in #9216
fix auto ci return bug when run in v100 by @AndSonder in #9228
[LLM] Add tools for parameters by @Hanyonggong in #9137
[AutoParallel] Add test for fuse_ffn and fuse_attention_qkv pass by @zhangbo9674 in #9203
[CI] Fix ci import. by @ZHUI in #9239
[Version] Update version info by @DrownFish19 in #9241
[Auto Parallel] Adding align mode support by @zhangyuqin1998 in #9150
[LLM INFER] top_p_sampling_reject support top_p=0 and custom seed by @gzy19990617 in #9202
[INFER] update tune_cublaslt_gemm op and fix some bugs by @yuanlehome in #9222
Reduce the time spent on git downloading third-party libraries by @vivienfanghuagood in #9246
[PIR] fix pir open bugs by @yuanlehome in #9248
Cherry-pick some PRs from incubate/paddlenlp-fleety by @sneaxiy in #9245
[Unified Checkpoint] Support expert parallel by @DesmonDay in #9055
[PIR] fix pir dt2st for chatglm_v2 by @yuanlehome in #9251
Cherry-pick some PRs from incubate/paddlenlp-fleety by @LiYuRio in #9253
[Unified Checkpoint] Fix generation config save by @DrownFish19 in #9223
[AutoParallel] Fix tests for pass paddle AutoParallel CI by @liym27 in #9267
change dataset by @lugimzzz in #9266
[Unified Checkpoint] update async save logic by @DesmonDay in #9274
add config file for model chatglm2,gemma,yuan by @Mangodadada in #9139
Fix async hang by @DesmonDay in #9276
[AutoParallel] Change llama test from sharding stage2 to stage1 by @zhangbo9674 in #9281
[Tokenizer] Enable padding_side as call time kwargs by @DrownFish19 in #9258
[Trainer] fix save_model by @DesmonDay in #9286
[CI] Skip inference test cases by @DrownFish19 in #9270
[LLM] Add deepseekv2 by @DrownFish19 in #9250
[Tokenizer] Unify tokenizer _pad by @DrownFish19 in #9280
[CI] Fix llm/alignment/rm/flashmask path by @DrownFish19 in #9289
support attention mask using causal=True by @GuoxiaWang in #9268
[FlashMask] Add FlashMask for Qwen2 by @DrownFish19 in #9264
bug fix for xpu_parallel_matmul by @FeixLiu in #9297
fix lora sharding v2 by @lugimzzz in #9300
[LLM INFER] Append attn by @yuanlehome in #9244
[Auto Parallel] fix bugs for split_batches_for_accumulation && fix bu… by @zhangyuqin1998 in #9217
[Tokenizer] Fix TokenizerFast missing clean_up_tokenization_spaces by @dynamicheart in #9304
clean llama static modeling file by @zhiqiu in #9301
[Unified Checkpoint] Accelerate loading checkpoint by multi-thread by @Crystal-X-111 in #9034
fix non-pipelinelayer to distributed by @gongel in #9310
change the legacy to slm by @wawltor in #9311
[TRL] Rename sft trainer. by @ZHUI in #9292
[XPU] support unified ckpt function by @cqulilujia in #9312
[LLM INFER] Fix some bugs and chatglm_v2 support block_attn by @yuanlehome in #9271
[Readme] Add flash mask by @lugimzzz in #9219
update llm infer docs by @yuanlehome in #9314
[Unified Checkpoint] Add split param and refactor code by @DesmonDay in #9240
[METAX] Support llama for MX C550 by @idontkonwher in #9186
update QR code by @DrownFish19 in #9325
add flash_attention on model chatglm_v2 by @Mangodadada in #9296
fix readme by @Mangodadada in #9326
[Unified Checkpoint] update non-merge checkpoint loading, move async_save_info.json location by @DesmonDay in #9321
[paddle cpu inference]fix cpu doc by @bukejiyu in #9299
[LLM INFER] add rope_theta for block_multihead_attention by @yuanlehome in #9334
Fix pr 9334 by @yuanlehome in #9335
fix parameter calculation in auto_parallel mode by @zhiqiu in #9327
[Docs] Update flashmask by @DrownFish19 in #9330
Update load_save_single_card.py by @DesmonDay in #9337
Update README.md by @DrownFish19 in #9339
[Tokenizer] Support reading Tiktoken tokenizer.model. by @lvdongyi in #9215
align default custom black/white list for dygraph and static graph by @zhiqiu in #9340
[intel_hpu] initial commit for intel_hpu support by @yanfeich in #9273
Compatible with Tensor.to change to out_of_place. by @DrownFish19 in https://github.co...

Contributors

co63oc, zhiqiu, and 54 other contributors

Assets 2

08 Oct 08:52

ZHUI

v3.0.0-beta2

81de41a

v3.0.0-beta2 Pre-release

Pre-release

本次更新强化了PaddleNLP的基础设施，新增了Qwen2.5、Mixtral 8*22B模型并升级了Tokenizer功能，同时重命名了数据索引工具。

此外，还修复了MoE模型参数保存与加载等问题，提升了文本处理准确性，并更新了文档与测试用例。在推理性能、硬件支持及自动并行方面也进行了优化，包括支持更多模型与参数配置、多GPU推理、国产硬件支持增强以及分布式训练流程优化等。

核心变更与增强功能

基础设施强化：
- 新增Qwen2.5模型（#9157 ），Mixtral 8*22B。进一步丰富模型库。
- Tokenizer功能升级，现支持加载额外解码标记added_tokens_decoder（#8997 ），提升灵活性。
- 数据索引工具tool_helpers重命名为fast_dataindex（#9134 ），以更直观反映其功能特性。
- 实现训练过程中数据间隔跳过的功能（#8989 ），优化数据处理效率。
- Unified Checkpoint优化：
  - 更新优化器异步保存信号（#8975 ），保证保存稳定。
  - 修复统一检查点中的多项问题（#9082 ），确保功能正确性。
问题修复：
- 解决了MoE模型参数保存与加载的问题（#9045 ）。
- 修正Tokenizer中空格与特殊符号处理的不足（#9010 , #9144 ），提升文本处理准确性。
文档与测试更新：
- 更新多个文档，涵盖LLM模型文档（如#8990 , #8999 ）及量化文档（#9057 ）等，确保信息的时效性与准确性。
- 新增测试用例，如针对PIR模式序列并行的测试（#9015 ），强化测试覆盖度。
- 修复文档中的链接错误（如#9127 ），提升用户体验。
其他关键变更：
- 推理性能优化：
  - LLM推理代码得到优化，支持更多模型与参数配置（如#8986 , #8995 ），拓宽应用场景。
  - 实现Qwen2_Moe多GPU推理（#9121 ）及wint4量化（#9129 ），提升推理效率。
  - 加强LLM推理对FP8与INT8的支持（如#9032 , #9151 ），满足多样化精度需求。
- 硬件支持拓展：
  - 增强对DCU、XPU、MLU等国产硬件的支持（如#8983 , #8504 , #9075 ），促进国产化替代。
  - 优化上述硬件上的模型训练与推理性能，提升整体运算效率。
- 自动并行优化：
  - 修复训练过程中数据重复跳过的问题（#8980 ），确保数据处理的正确性。
  - 更新自动并行配置与检查点转换器（如#8847 , #9136 ），提升并行训练的灵活性与稳定性。
  - 新增损失NaN/Inf检查器（#8943 ），及时发现并处理潜在数值问题。
  - 优化分布式训练中的数据加载与梯度合并流程（如#9120 , #9179 ），提升训练速度与稳定性。

What's Changed

[Unified checkpoint] update optimizer async save signal by @DesmonDay in #8975
更正run_dpo.py文件路径 by @Mangodadada in #8952
fix the loss base in llama_align_dygraph_dy2st_auto_bs2_bf16_DP2-MP1-… by @winter-wang in #8986
[Bug fix] fix skip consumed_samples twice bug by @zhangyuqin1998 in #8980
fix pip error in legacy benchmarks by @fightfat in #8978
【auto_parallel】Add checkpoint convertor by @xingmingyyj in #8847
[llm]update finetune.md by @lugimzzz in #8990
tool_helpers升级后可以支持32766个数据集. by @JunnYu in #8994
add DCU inference docs by @YanhuiDua in #8983
[Distributed]Add loss nan/inf checker by @ForFishes in #8943
【llm】update docs by @lugimzzz in #8999
[Feature] Fused Mixtral support by @penPenf28 in #8901
[XPU] Add README.md for llama2-7b by @xiguapipi in #8979
Add gcu llama readme by @EnflameGCU in #8950
fix qwen model use_casual_mask by @deepllz in #9009
[ZeroPadding] revert zero_padding #8973 by @DrownFish19 in #9003
[LLM Inference] Fix step.cu bug by @yuanlehome in #8995
Refine checkpoint converter by @zhangbo9674 in #9001
[Feature] fused mixtral wint4 by @penPenf28 in #9013
llm inference docs by @Sunny-bot1 in #8976
[LLM Inference] Support Qwen2_Moe Inference Model by @CJ77Qi in #8892
fix llama3 static run by @yuanlehome in #8849
[paddle inference cpu]update cpu inference by @bukejiyu in #8984
fix the tipc ce case by @wawltor in #8748
[Cherry-pick] Add is_distributed field in sharding reshard param_meta by @sneaxiy in #9028
[Tokenizer] Support for loading added_tokens_decoder by @DrownFish19 in #8997
[Inference] Add a8w8(fp8) a8w8c8(int8) quant_type support by @lixcli in #9032
Fix checker of nan/inf by @ForFishes in #9029
[Cherry-pick] add comm buffer size (#8963) by @ForFishes in #9031
[Unified Checkpoint] Update async save info by @DesmonDay in #8982
[llm]support pad to max_length & fix sp bug by @lugimzzz in #9040
[Bugfix] fix bias optional by @penPenf28 in #9037
fix setup.py for llm inference by @yuanlehome in #9041
[Inference] Add cutlass gemm dequant op by @gzy19990617 in #8909
[Inference] update fakequant support by @lixcli in #9047
add test for pir sequence parallel on llama model by @liym27 in #9015
Fix moe save load by @Meiyim in #9045
Update quantization.md by @ZHUI in #9057
【Fix】Initialize dp degree in single GPU by @greycooker in #9056
fix bos download by @westfish in #9023
[Inference] Update fakequant script by @lixcli in #9054
[AutoParallel][PIR] Fit pir grad merge by @AndSonder in #8985
[MLU] Support rms_norm_mlu by @PeiyuLau in #8504
[Inference] support llama3 a8w8c8_fp8 inference and cutlass_fp8_gemm by @ckl117 in #8953
[Inference] Qwen2 support fp8 inference by @ckl117 in #8954
[Version] update version info by @DrownFish19 in #9060
[NPU] Fix baichuan2-13b-chat infer by @ronny1996 in #9070
[MLU] Fix Llama attrntion_mask in npu and mlu by @DrownFish19 in #9075
Fix the memory overflow bug of the tune_cublaslt_gemm operator by @Hanyonggong in #9076
[Inference] Fix weight_only_int4 bug by @lixcli in #9073
[Auto Parallel] fix data stream bug of dist.to_static by @zhangyuqin1998 in #9077
fix hang when Flag_dataloader_use_file_descriptor=True by @deepllz in #9080
fix llm predict install error by @fightfat in #9088
[PIR] add pir grad merge test by @AndSonder in #9074
Update readme by @EnflameGCU in #9046
[LLM] Add tensor parallel for chatglmv2 by @SevenSamon in #9014
[data] update tool_helpers version and add unittest by @JunnYu in #9093
fix baseline because of PR#8769 by @fightfat in #9092
fix use paddle.incubate.jit.inference(model) errors by @chang-wenbin in #9016
[CI] Fix paddlepaddle install by @DesmonDay in #9102
[LLM] fix train on npu by @SylarTiaNII in #9101
Disable ut by @zhangbo9674 in #9108
[AutoParallel] Enable CI for gradclip by @JZ-LIANG in #9059
[Inference] Remove ceval from run_finetune by @lixcli in #9100
[Bugfix] fix multi-gpu infer by @penPenf28 in #9107
【Inference】fix step kernel by @gzy19990617 in #9122
[DCU] fix DCU w8a8c8 GEMM shape by @YanhuiDua in #9115
[Inference] FP8 gemm auto-tune by @ckl117 in #9094
Open ut llama_align_dygraph_dy2st_pir_auto_grad_merge_bs2_fp32_DP1-MP1-PP1 by @zhangbo9674 in #9120
[LLM Inference] Support Qwen2_Moe Inference with MultiGPU by @CJ77Qi in #9121
[Unified Checkpoint] Fix uc lora config, fix release_grads by @DesmonDay in #9082
[Inference]qwen2-a8w8c8 support use_fake_parameter by @ckl117 in #9109
Add fast_ln spmd rules by @From00 in #9125
fix pir dtype by @wanghuancoder in #9130
Remove ring_flash_attention warning by @DrownFish19 in #9119
[DOC] Fix LLM page 404 Not Found by @DrRyanHuang in #9127
Add hardware flops for pretraining by @ZHUI in #9069
[Benchmark] Fix amp level bug in some gpt tests by @zhangbo9674 in #9116
[Auto Parallel] Fix ckpt_converter for auto_parallel by...

Contributors

tizhou86, Meiyim, and 44 other contributors

Assets 2

22 Aug 03:41

ZHUI

v3.0.0-beta1

7473743

v3.0.0-beta1 Pre-release

Pre-release

PaddleNLP从v3.0.0-beta0升级至v3.0.0-beta1版本，带来了多项重要更新与增强。新引入了Yuan、mamba和jamba模型，并优化了LLM推理代码，提升了兼容性和效率。

基础性能优化方面，添加了快速分词器，实现了MoE优化器参数广播，加速了层归一化。同时，修复了多个bug，包括safetensors shape切片问题和Windows下mmap问题，提升了系统稳定性和兼容性。

文档与测试方面，进行了全面更新和优化，确保了文档的准确性和代码的可读性。此外，还增强了国产硬件支持，包括DCU和XPU的优化，以及PIR模式和自动并行的配置更新。

主要变更与新增功能

1. 新模型与特性引入

新模型：在#8654 中引入了Yuan模型；在#8513 和#8517 中分别添加了mamba和jamba新模型，并在后续Pull Request中修复了相关bug，确保了模型的稳定运行。
LLM推理优化：通过多个Pull Request，我们优化了LLM推理代码，并新增了对新模型和参数的支持，进一步提升了推理效率和兼容性。

2. 基础性能优化

快速分词器：在#8832 中，我们添加了基于tokenizers库的快速分词器，显著提升了分词速度和性能。
MoE优化：在#8810 中，我们实现了MoE（Mixture of Experts）优化器参数的广播，有效增强了模型训练的效率。
层归一化加速：通过多个Pull Request，我们添加了fast_rmsnorm，启用了use_fast_layer_norm，并更新了基准测试配置，进一步加速了模型训练过程。特别是在#8717 中，我们支持了在微调过程中使用use_fast_layer_norm，为用户提供了更多灵活性。
训练性能优化：在#8803 中，我们添加了enable_sp_async_reduce_scatter选项，有效优化了训练性能。
字典参数支持：在#8446 中，我们为trainer的argparser添加了支持字典参数的新特性，增强了参数传递的灵活性。同时，在#8904 中，我们更新了tensorboard的要求，确保了与最新版本的兼容性。

3. Bug修复

safetensors修复：在#8702 中，我们修复了safetensors的形状问题。
Windows系统mmap修复：在#8734 中修复了mmap问题，提升了windows的兼容性。
其他Bug修复：包括#8687 、#8730 等多个Pull Request中的bug修复。

4. 文档与测试更新

文档优化：在多个Pull Request中，我们进行了文档更新、代码风格清理和版本信息更新，确保了文档的准确性和可读性。
README修复与增强：在#8741 中，我们修复了README中的断链问题；同时，多个贡献者更新了README文档，添加了新的测试用例，确保了文档与代码的同步更新。

5. 其他重要变更

国产硬件支持增强

DCU支持：在#8580 中，我们实现了针对DCU的高性能LLM训练和推理，拓展了PaddleNLP的硬件支持范围。
XPU优化：在#8527 中，我们为XPU添加了LoRA优化；在#8697 和#8710 中，我们分别实现了XPU的allgather功能和修复了统一检查点的gather问题，进一步提升了XPU上的模型训练效率。

PIR模式支持

导出与加载优化：在#8689 中，我们修改了PIR模式下llama模型的导出方式；在#8712 和#8766 中，我们支持了以三种模式（旧IR、PIR模型文件、PIR JSON文件）加载或保存Llama2-7b模型，为用户提供了更多灵活性和兼容性。

自动并行优化

配置更新：在#8679 中，我们更改了Llama2-7b配置中的max_steps以适应自动并行；在#8767 和#8828 中，我们优化了自动训练器的保存和加载功能；在#8750 中，我们更新了全局剪切的损失函数，进一步提升了自动并行的效率和准确性。

What's Changed

[DCU] high performance LLM train and inference for DCU by @yuguo-Jack in #8580
fix benchmark dir and add CUDA_DEVICE_MAX_CONNECTIONS to qwen by @fightfat in #8678
bug fix by @wtmlon in #8687
[XPU] add lora optimization by @dynamicheart in #8527
[pir save] Modiy export llama model file in pir mode by @xiaoguoguo626807 in #8689
[AutoParallel]Change max_steps in Llama2-7b config for auto-parallel. by @heavyrain-lzy in #8679
[benchmark] Change the mirror source for pip by @mmglove in #8699
update loss base of auto-parallel tests by @zhiqiu in #8701
Add new mistral by @wtmlon in #7425
[Safetensors] Fix safetensors shape by @DesmonDay in #8702
[BUG] num_samples 向下去整, 防止prefrech预取时候超过数据集最大长度... by @JunnYu in #8690
xpu use allgather by @FeixLiu in #8697
add fast_rmsnorm by @deepllz in #8680
enable use_fast_layer_norm for llama2 benchmark by @deepllz in #8714
fix xpu gather for unified ckpt by @FeixLiu in #8710
[inference] support load or save Llama2-7b in three patterns by @lizexu123 in #8712
fix fast_ln backward by @deepllz in #8719
finetune support use_fast_layer_norm by @tianhaodongbd in #8717
bug fix by @FeixLiu in #8730
disable lora by @lugimzzz in #8674
[Safetensors] Fix mmap for Windows system by @DrownFish19 in #8734
correct broken links in readme by @jzhang533 in #8741
revert benchmark fix by @ronny1996 in #8747
[LLM] Add Yuan model by @zhaogf01 in #8654
fix nlp dir and auto_parallel_ci exit -6 by @fightfat in #8744
[LLM] Update sequence parallel linear import by @DrownFish19 in #8706
[Bug fixes] Fix ring attention by @zhangyuqin1998 in #8740
update a100 loss by @zhiqiu in #8708
[PaddleNLP 3.0] Update README by @DrownFish19 in #8681
[AutoParallel] update loss for global clip by @JZ-LIANG in #8750
[NPU] Fix sequence parallel lib import by @DrownFish19 in #8760
[DEV] Update develop version show by @DrownFish19 in #8754
[inference] support load or save Llama2-7b in three patterns by @lizexu123 in #8766
add benchmark baichuan2 scripts by @fightfat in #8683
Add the missing truncation=True in llm/predictor.py by @lszxb in #8768
fix the ce for the unittest by @wawltor in #8772
Enable parallel_config to use commas as delimiters. by @Difers in #8677
fix incorrect token counting in llm/predictor.py by @lszxb in #8769
Refine savable by @ZHUI in #8758
[CodeStyle] remove markdownlint-cli by @DrownFish19 in #8779
[XPU] use allgather and fp32 multinomial for XPU by @houj04 in #8787
fix version show by @DrownFish19 in #8791
[BUG] Add 20 redundant data in post pretrain by @JunnYu in #8789
vera-pissa method added by @TranscenderNing in #8722
update version by @DrownFish19 in #8792
[Inference LLM] refine some code in llama wint8/4 by @yuanlehome in #8796
[DCU] Llama a8w8 inference performance optimization by @Deleter-D in #8800
[Prediction] Update LLM prediction. by @DesmonDay in #8778
[Trainer] Add enable_sp_async_reduce_scatter by @DesmonDay in #8803
[AutoParallel] Refine auto_trainer save load by @zhangbo9674 in #8767
[MoE] Optimizer parameter broadcast by @DesmonDay in #8810
[Doc] Update README by @DrownFish19 in #8817
support Llama3.1 8B 128K generation on single GPU 80GB by @GuoxiaWang in #8811
add paddle nv-embed-v1 by @Li-Z-Q in #8785
fix pad_token_id bug by @yuanlehome in #8814
[DCU] fix llama inference bug on DCU by @Deleter-D in #8815
[Doc] Add LLaMA3.1 by @DrownFish19 in #8824
[BUG] Fix build train valid test datasets by @JunnYu in #8826
Add tune_cublaslt_gemm operator by cublaslt gemm algorithm and generate algo cache file by @Hanyonggong in #8799
fix tune_cublaslt_gemm compile bug by @yuanlehome in #8844
[AutoParallel] Refine save and load ckpt for auto_trainer by @zhangbo9674 in #8828
[Unified Checkpoint] update merge tensor parallel by @DesmonDay in #8856
[Trainer] update clear_grad by @DesmonDay in #8829
[Unified Checkpoint] Fix tie_word_embeddings by @DesmonDay in #8795
[Inference LLM] support static c8 by @yuanlehome in #8833
support sft mapdataset by @greycooker in #8840
Cherry pick some changes from incubate branch by @sneaxiy in #8862
support nested list of dict inputs by @deepllz in #8876
Fix the bug with issues code 8641. by @smallbenxiong in #8880
Fix the issue of P-tuning official sample error by @guangyunms in #8884
modify Paddlemix qwen dytostatic by @xiaoguoguo626807 in #8869
[llm]fix zeropadding by @lugimzzz in #8895
修复fast_ln算子动半开启后报错 by @Wennie396 in #8891
enable_sp_async_reduce_scatter for qwen_72b && llama2_70b by @deepllz in #8897
Update run_pretrain.py by @...

Contributors

jzhang533, zhiqiu, and 41 other contributors

Assets 2

28 Jun 03:05

DrownFish19

v3.0.0-beta0

a2b8a78

v3.0.0-beta0

很高兴地通知大家，飞桨大模型套件发布v3.0.0beat版本：拥抱大模型，体验全升级。具体工作如下：

统一大模型工具链，实现国产计算芯片全流程接入；
全面支持飞桨4D并行配置、高效精调策略、高效对齐算法、高性能推理等大模型产业级应用流程；
自研极致收敛的RsLoRA+算法、自动扩缩容存储机制Unified Checkpoint和通用化支持FastFFN、FusedQKV助力大模型训推；
主流模型持续支持更新，提供高效解决方案。

大模型精调对齐训推优化

PEFT：
- 新增scaling策略，支持rslora, pissa算法 in #8256
- 适配FusedQKV和FastFFN参数 in #8372 #8526
DPO：
- 支持DPO（llama，qwen）in #8474
- 支持序列并行 in #7953
国产芯片支持：
- 适配NPU in #8303 #8342 #8359 #8399 #8409 #8401 #8431 #8439 #8438 #8442 #8528 #8642
- 适配XPU in #8282 #8505 #8515 #8588 #8595 #8598
- 适配GCU in #8445 #8470
性能优化：
- 优化Unified Checkpoint机制 in #8204 #8409 #8422 #8512
- 模型并行优化 in #8370
- 序列并行优化 in #8551
- 支持llama3 (wint8|4/a8w8) in #8630
其他
- 新增模型内存监控 in #8269

模型新增

新增Gemma模型 in #8082
- google/gemma-7b
- google/gemma-7b-it
- google/gemma-2b
- google/gemma-2b-it
新增llama3模型 in #8307 #8371
- meta-llama/Meta-Llama-3-8B
- meta-llama/Meta-Llama-3-8B-Instruct
- meta-llama/Meta-Llama-3-70B
- meta-llama/Meta-Llama-3-70B-Instruct
新增Qwen2模型 in #8338 #8584 #8601
- Qwen/Qwen1.5-0.5B
- Qwen/Qwen1.5-0.5B-Chat
- Qwen/Qwen1.5-1.8B
- Qwen/Qwen1.5-1.8B-Chat
- Qwen/Qwen1.5-4B
- Qwen/Qwen1.5-4B-Chat
- Qwen/Qwen1.5-7B
- Qwen/Qwen1.5-7B-Chat
- Qwen/Qwen1.5-14B
- Qwen/Qwen1.5-14B-Chat
- Qwen/Qwen1.5-32B
- Qwen/Qwen1.5-32B-Chat
- Qwen/Qwen1.5-72B
- Qwen/Qwen1.5-72B-Chat
- Qwen/Qwen1.5-110B
- Qwen/Qwen1.5-110B-Chat
- Qwen/Qwen1.5-MoE-A2.7B
- Qwen/Qwen1.5-MoE-A2.7B-Chat
- Qwen/Qwen2-0.5B
- Qwen/Qwen2-0.5B-Instruct
- Qwen/Qwen2-1.5B
- Qwen/Qwen2-1.5B-Instruct
- Qwen/Qwen2-7B
- Qwen/Qwen2-7B-Instruct
- Qwen/Qwen2-72B
- Qwen/Qwen2-72B-Instruct
- Qwen/Qwen2-57B-A14B
- Qwen/Qwen2-57B-A14B-Instruct

基础框架升级

功能优化：
- 支持FusedQKV和FastFFN权重自动融合分割 in #8202 #8378 #8432
- 支持模型并行参数同步设置 in #8311
- 支持RoPE算子设定theta in #8440
- 通信overlap优化 in #8276 #8473 #8499 #8594
AutoParallel优化
- llama支持recompute机制 in #8265
- 适配llama3 in #8395
- position_ids优化 in #8363
- 支持流水线并行split_backward in #8479
- 适配qwen in #8312
分布式能力优化：
- 修复流水线并行中enable_sharding_comm_overlap中参数错误问题 in #8333
- MoE并行支持 in #8498 #8522
chat能力优化：
- 增加Chat template in #8226
其他
- 文档 in #8336 #8393
- 更新nested操作 in #8380
- 随机性更新 in #8450 #8396
- 算子更新 in #8472
- example更新 in #8538

问题修复

修复sharding数量小于100的bug in #8146
修复TP/PP参数合并问题 in #8239
修复tensor.shape与paddle.shape(tensor)不一致问题 in #8260
修复fp16+delay_scale_loss_scale+sharding_stage1_overlap的bug in #8314
增加pipelines运行文档及提示 in #8292 #8308 #8202 #8353
修复text feature extraction任务中tokenizer输入 in #8331
修复import error in #8332 #8367

结构调整

PaddleNLP文件结构调整 in #8609 #8613 #8605 #8614 #8617 #8626 #8618 #8625 #8619 #8629 #8601 #8627 #8666

What's Changed

[dist]pip requirements-dev.txt by @Liujie0926 in #8258
add scaling by @lugimzzz in #8256
[LLM]Support Gemma model by @Southpika in #8082
[BugFix] Try except sequence parallel utils by @DesmonDay in #8189
Update CodeCov GitHub Action by @sijunhe in #8268
[AutoParallel] Open recompute strategy for llama model by @zhangbo9674 in #8265
Fix sharding < 100 limitation bug by @sneaxiy in #8146
use tensor.shape bug not paddle.shape(tensor) by @wanghuancoder in #8260
[dist CI]update paddlenlp install for CI by @Liujie0926 in #8267
[Bug Fix]Fix merge parameters in pp by @Southpika in #8239
[LLM] add memory stats to logger of trainer by @SylarTiaNII in #8269
Add p2p_comm_overlap for Llama-2-70b benchmark. by @Xreki in #8276
add a100 test ground truth by @zhiqiu in #8249
[paddle-pipelines] faq semantic search question answering reamde by @w5688414 in #8292
[paddle-pipelines] Add pipelines documentation by @w5688414 in #8308
Support llama-3 by @ZHUI in #8307
[Distributed] [CustomDevices] Adapt SP on lora && polish MC2 APIs by @SylarTiaNII in #8303
fix bug for fp16 + delay_scale_loss_scale + sharding_stage1_overlap by @FeixLiu in #8314
[paddle-pipelines] Update mkdocs by @w5688414 in #8310
[benchmark]update llama2_ips by @Liujie0926 in #8322
[dist CI]fix before_hook by @Liujie0926 in #8283
benchmark llama worker=1 by @wanghuancoder in #8305
【AutoParallel】Add llama2 UT for auto-parallel by @heavyrain-lzy in #8300
Add system env log for llama test by @zhangbo9674 in #8321
[LLM] Support fuse attention q, k, v weights by @DrownFish19 in #8202
[Distributed] fix lora by @SylarTiaNII in #8325
fix try import by @w5688414 in https://github.com/PaddlePaddle/Pa...

Contributors

zhiqiu, jeff41404, and 49 other contributors

Assets 2

20 Jun 07:42

ZHUI

v2.8.1

db99efd

v2.8.1

What's Changed

[Trainer] Fix sharding overlap bug by @DesmonDay in #8334
[Cherry-pick] update truncate by @KB-Ding in #8375
[BugFix] Fix llama3 eot_id. by @ZHUI in #8373
[Trainer] update distributed dataloader by @DesmonDay in #8426
[BugFix] Fix load rng compatibility. by @ZHUI in #8451
Cherry pick/fast_safe_open by @ZHUI in #8458
【cherry pick】adapter new type promotion rule for Paddle 2.6 by @zxcd in #8463
Quick fix from pretrained. by @ZHUI in #8487
Release/2.8 by @Galaxy1458 in #8437
Fix from_pretrained os.path.split by @DesmonDay in #8508
[fea] Cherry-picked MOE updates from develop by @bo-ke in #8531
[LLM] relocate tensor_parallel_output to avoid conflict (#8419) by @DesmonDay in #8533
Update sequence_parallel for predict by @DesmonDay in #8547
Cp/fix by @ZHUI in #8569
Do not save moe_group by @DesmonDay in #8570
[Release] 2.8.1 by @ZHUI in #8636

Full Changelog: v2.8.0...v2.8.1

Contributors

zxcd, ZHUI, and 4 other contributors

Assets 2

24 Apr 10:04

w5688414

v2.8.0

3105c18

v2.8.0

很高兴地通知大家，飞桨大模型套件发布v2.8.0版本。这个版本中，我们深度优化套件的大模型精调对齐的能力，提升大模型套件在国产计算硬件训推能力，具体工作如下：

特色精调和高效对齐：提供自研极致收敛的RsLoRA+算法，大幅提升PEFT训练收敛速度以及训练效果；引入高性能生成加速到RLHF PPO算法，打破 PPO 训练中生成速度瓶颈，PPO训练性能大幅领先。
大模型训练提速：通用化支持 FastFNN、FusedQKV等多个大模型训练性能优化方式，大模型训练更快、更稳定。

大模型精调对齐训推优化

精调
- PEFT
  - 新增QLoRA pipeline parallel支持 #7801
  - 自定义python算子，优化LoRA的前反向计算 #8106
  - 新增 rslora，lora+，pissa 算法 #8111
- 长序列
  - 新增长序列方案和模型解耦。RotaryEmbedding，LinearScalingRotaryEmbedding，NTKScalingRotaryEmbedding，DynamicNTKScalingRotaryEmbedding等。#8076
- Alignment
  - 新增PPO 对齐算法 #7305
- 训练策略
  - 新增LLaMA sequence parallel #7746
  - 新增LLaMa master_grad #7658
  - GPT新增auto_parallel的支持。 #8160
- 新增算子
  - 新增GQA 算子支持 #7906
  - 新增gqa fuse attention qkv #7890
  - 新增SwiGLU 算子 #8038
推理
- 新增QWenVL 的静态图推理 #7808
  模型新增
新增Deberta，Debertav2模型 #8227
- deepset/deberta-v3-large-squad2
- microsoft/deberta-v2-xlarge
- microsoft/deberta-v3-base
- microsoft/deberta-v3-large
- microsoft/deberta-base
新增mixtral-of-experts #7803
- mistralai/Mixtral-8x7B-Instruct-v0.1
- mistralai/Mixtral-8x7B-v0.1
新增LLama3 #8315
- meta-llama/Meta-llama-3-8b
- meta-llama/Meta-Llama-3-8B-Instruct
- meta-llama/Meta-llama-3-70b
- meta-llama/Meta-Llama-3-70B-Instruct

基础框架升级

Trainer升级
- Trainer新增 ignore_save_lr_and_optim 参数，可以忽略保存lr scheduler以及optimizer权重 #7978
- Trainer新增 Wandb 和 Tensorboard 支持。#7863
- Trainer支持同时解析命令行与json文件参数 #7768
- trainer新增gradient_sync_after_accumulate支持。#8045
- dataloader新增cuda编译检查 #8099
AutoParallel升级
- llama 自动并行支持bf16损失 #7874
- 增加refined-recompute机制#7349
- 在AMP-O2策略下支持master_grad#7658
- 进一步完善动静统一自动并行分布式训练基本功能#7985 #8114
- 新增Llama2模型基于AutoTrainer的半自动训练 #7851 #7885
- 新增llama的hybrid_parallel_topo_order策略。#8011
- llama模型组网动静统一 #8127
其他
- 重构download下载逻辑，支持从bos、hf hub、aistudio、model scope下载模型 #7608 #8020 #8088
- 新增分布式训练的pipeline parallel #8051
- 适配npu的FA #8171 #8210
- llama新增block_attention/cachekv quant #7649

其他支持

新增俄罗斯套娃（matryoshka representation learning）检索策略，节省计算和存储资源。#8165

问题修复

日志级别修改，并增加timelog计时日志，兼容不同设备。#8261
修复pipeline并行中随机初始化的shared weights不一致的问题，覆盖GPT/OPT等模型。#7772
关闭CI及单测中从huggingface hub下载的逻辑 #7798 #8198
修复llm的gradio开启chat template时候重复拼接query 和 history的问题。#7992
修复GPT模型下载key error问题。#8253
修复LlamaRotaryEmbedding #7882
修复allreduce dtype的问题 #7876
修复框架侧dev分支清理 paddle.jit.dy2static.utils_helperAPI的问题 #7989
修复read-data timer在ignore_data_skip=False and skip_profile_timer=False 的问题。#8177
修复Wandb单测问题 #8066 #8056
修复Trainer同时解析json与命令行列表参数报错问题#7860
修复Gradio UI 中的推理问题 #7740 #7788
修复 Tokenizer 相关的基础问题 #7797 7870
修复 custom devices上loading rng state的问题。#7894
修复自动并行打印BF16的loss编码错乱的问题#7874
采用float初始化模型，修复静态图自动并行AMP报错问题#8033#8199
修复ShardDataloader接口在PipeLine Parallelism下使用错误问题#8014
修复llama在custom devices的精度问题。#7895
修复NPU AICPU算子问题 #7976
修复FusedLinearWithGradAdd少传参数的问题。#8178

What's Changed

[Unified Checkpoint] Add unified checkpoint training args doc. by @DesmonDay in #7756
[AutoParallel] Auto Trans PP to VPP by @zhaoyinglia in #7747
Add codecov check by @zjjlivein in #7760
[CE] Delete gpt_for_sequence_classification by @ZHUI in #7757
[DOC] Update trainer.md by @ZHUI in #7761
[Release] Change version to 2.7.0 by @ZHUI in #7764
[benchmark]close skip_memory_metrics for ips by @Liujie0926 in #7732
[Release] Update release.yml to release tags by @ZHUI in #7765
[AutoParallel] Add Sequence Parallel for Static LLaMA by @JZ-LIANG in #7746
[New Features] support dynamic src_length by @wj-Mcat in #7740
Fix unified_checkpoint bug by @DrownFish19 in #7770
[DONE] aistudio, hf hub, bos update download by @JunnYu in #7608
[Trainer] Fix dist dataloader eval by @DesmonDay in #7777
[Paddle-pipelines] Update convert_files_to_dicts_splitter by @w5688414 in #7748
[PEFT]fix lora model tp when existing other trainable module by @lugimzzz in #7781
[Paddle-Pipelines] update faiss by @qingzhong1 in #7793
Fix shared weights sync for PipelineLayer by @DrownFish19 in #7772
[tests] download slow by @JunnYu in #7798
[INFER][LLM] Support qwen in fined grained dybatch v1 by @DanGuge in #7644
Add CE for Distributed Hybrid Parallel by @iosmers in #7782
add MP2-SP2-pp4-vpp2-SD2-stage1-mbs2-acc8 ce by @tianhaodongbd in #7774
[Pretrain] Fix eval during pretrain by @DesmonDay in #7806
pipeline parallel benchmark by @zhangting2020 in #7759
[Bug fixes] fix br gradio by @wj-Mcat in #7788
delete useless code for write_cache_kv.cu by @yuanlehome in #7812
[llm]support qlora pp by @lugimzzz in #7801
Trainer support simultaneously parse JSON files and cmd arguments. by @greycooker in #7768
[LLM] Support block_attention/cachekv quant for llama by @RichardWooSJTU in #7649
[Bug Fix] fix paddle multipy_fwd_func warning message by @BeingGod in #7818
[llm]fix lora by @lugimzzz in #7824
fused rms spmd by @liuzhenhai93 in #7830
[Pretrain] Fix eval during pretrain by @DesmonDay in #7827
[neural search][fix bug of evaluate.py] by @ZeyuTeng96 in #7832
[neural search] fix the bug of reading files when calculating the recall scores by @shenghwa in #7836
[Bug fixes] update chatglm tokenizer by @wj-Mcat in #7797
[semantic_indexing] fix bug of evaluate.py by @ZeyuTeng96 in #7843
[faq] fix bug of evaluate.py by @ZeyuTeng96 in #7840
[text_classification_retrieval_based] fix bug of evaluate.py by @ZeyuTeng96 in #7844
[LLM] add Qwen-7B-Chat to PaddleNLP unit test by @ziangqin-baidu in #7823
Support 5.2 bloom by @zhoutianzi666 in #7846
[unified checkpoint] Fix last checkpoint save by @DrownFish19 in #7854
[unified checkpoint] fix checkpoint names by @DrownFish19 in #7795
[New Features]add ranks testing for test_predictor by @wj-Mcat in #7800
[Auto Parallel] Support dynamic semi-auto training in Llama2 model by @haohongxiang in #7851
[CI] add ci approval pipelines by @zjjlivein in #7859
[fix] fix a bug of trainer/argparser.py by @greycooker in #7860
[Improvement] fix ops improting in utils by @wj-Mcat in #7865
[Add CE] Add CE for Hybrid Parallism by @iosmers in #7817
[Unified Checkpoint] Cherry pick empty cache. by @ZHUI in #7868
Add PPO training. by @guoshengCS in #7305
Update reward_main.py by @wawltor in #7880
Update ppo_main.py by @wawltor in #7881
[LLM] revert benchmark codes by @RichardWooSJTU in #7871
[LLM]support QWenVL second part by @DanGuge in #7808
[Bug Fixes] update chatglm1 tokenizer by @wj-Mcat in #7870
【AutoParallel】Support 'master_grad' in Llama in static auto-parallelism by @heavyrain-lzy in #7658
[Bug Fix] fix slice bug in LlamaRotaryEmbedding by @MarioLulab in #7882
【AutoParallel】Support bf16 loss in static by @heavyrain-lzy in #7874
[Bug Fix] fix allreduce tensor dtype by @BeingGod in #7876
[CE] Add Qwen into CE process by @ziangqin-baidu in #7887
[Hackathon 5th No.73] ToT by @ErnestinaQiu in #7660
[CustomDevice] fix loading rng state on custom devices by @SylarTiaNII in #7894
[LLM] ...

Contributors

co63oc, zhiqiu, and 54 other contributors

Assets 2

30 Jan 07:50

ZHUI

v2.7.2

b39e701

v2.7.2

本版本做了一些小问题的修复

What's Changed

[Unified Checkpoint] fix checkpoint names by @DrownFish19 in #7794
[Unified Checkpoint] Fix last checkpoint save by @DrownFish19 in #7810
[PEFT] Cherry pick lora fix by @lugimzzz in #7826
[Unified Checkpoint] Fix unified checkpoint by empty cache. by @ZHUI in #7855
[Fix Download] update converted logic & fix hf hub download subfolder bug by @JunnYu in #7911
[Cherry-pick] logger level by @KB-Ding in #7920
[Cherry-pick] RuntimeTimer for the toolkit (#7913) by @KB-Ding in #7921
[Release] 2.7.2 for paddlenlp bugfix. by @ZHUI in #7892

Full Changelog: v2.7.1...v2.7.2

Contributors

DrownFish19, ZHUI, and 3 other contributors

Assets 2

04 Jan 14:24

ZHUI

v2.7.1

bb9062e

v2.7.1

本版本做了一些小问题的修复

What's Changed

修复了训练恢复遇到的一些问题 @ZHUI in #7771
修复了GPT在Pipeline模式下的初始化问题 @DrownFish19 in #7775
修复了dist dataloader评估时的问题。 @DesmonDay in #7778

Full Changelog: v2.7.0...v2.7.1

Contributors

DrownFish19, ZHUI, and DesmonDay

Assets 2

Releases: PaddlePaddle/PaddleNLP

Stable RL v1.0.0

Uh oh!

v3.0.0-beta4

重点更新：

模型新增

推理部署

模型训练：

其他重点特性：

1. 模型、框架组件更新

2. LLM 训练更新

3. Inference 更新

4. AutoParallel / 分布式训练更新

5. CI、文档、Benchmark 及测试脚本更新

6. NPU/XPU 及硬件相关更新

7. Bug 修复、性能优化及其他改进

8. 环境/依赖及版本兼容更新

What's Changed

Contributors

Uh oh!

v3.0.0-beta3

主要更新与增强

What's Changed

Contributors

Uh oh!

v3.0.0-beta2

核心变更与增强功能

What's Changed

Contributors

Uh oh!

v3.0.0-beta1

主要变更与新增功能

1. 新模型与特性引入

2. 基础性能优化

3. Bug修复

4. 文档与测试更新

5. 其他重要变更

国产硬件支持增强

PIR模式支持

自动并行优化

What's Changed

Contributors

Uh oh!

v3.0.0-beta0

大模型精调对齐训推优化

模型新增

基础框架升级

问题修复

结构调整

What's Changed

Contributors

Uh oh!

v2.8.1

What's Changed

Contributors

Uh oh!

v2.8.0

What's Changed

Contributors

Uh oh!

v2.7.2

What's Changed

Contributors

Uh oh!

v2.7.1

What's Changed

Contributors

Uh oh!