PaddlePaddle
diff --git a/‎docs/llm/peft.md
Lines changed: 1 addition & 1 deletion b/‎docs/llm/peft.md
Lines changed: 1 addition & 1 deletion
diff --git a/‎llm/llama/benchmark.py renamed to ‎legacy/examples/benchmark/llm/llama_single_gpu/benchmark.py b/‎llm/llama/benchmark.py renamed to ‎legacy/examples/benchmark/llm/llama_single_gpu/benchmark.py
diff --git a/‎llm/llama/benchmark_utils.py renamed to ‎legacy/examples/benchmark/llm/llama_single_gpu/benchmark_utils.py b/‎llm/llama/benchmark_utils.py renamed to ‎legacy/examples/benchmark/llm/llama_single_gpu/benchmark_utils.py
diff --git a/‎llm/.gitignore
Lines changed: 0 additions & 12 deletions b/‎llm/.gitignore
Lines changed: 0 additions & 12 deletions
diff --git a/‎llm/Alignment/RM/models
Lines changed: 0 additions & 1 deletion b/‎llm/Alignment/RM/models
Lines changed: 0 additions & 1 deletion
diff --git a/‎llm/README.md
Lines changed: 40 additions & 44 deletions b/‎llm/README.md
Lines changed: 40 additions & 44 deletions
@@ -277,4 +277,4 @@ key function
         该函数会遍历整个权重参数列表，对于每个权重参数weight，统计所有进行梯度更新的参数，最后将信息打印出来。
 ```
 
-更详细的使用可以参考[finetuning 脚本](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/causallm/finetune_generation.py)版本, 以及对应的启动脚本编写方式（写在 [README.md](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/causallm/README.md)文件中)。
+更详细的使用可以参考[finetuning 脚本](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/run_finetune.py)版本, 以及对应的启动脚本编写方式（写在 [README.md](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/causallm/README.md)文件中)。
@@ -19,17 +19,17 @@
 
 ##  🛠️ 支持模型列表 🛠️
 
-| Model | Pretrain | SFT | LoRA | Prefix Tuning |  Quantization | Weight convert |
-| --- | --- | --- | --- | --- | --- |  --- |
-| [LLaMA/LLaMA2](./llama) | ✅  | ✅ | ✅ | ✅ | ✅  | ✅  |
-| [Baichuan/Baichuan2](./llama) | ✅  | ✅ | ✅ | ✅ | ✅  | ✅  |
-| [ChatGLM-6B](./chatglm) |  ❌  |  ✅  |    ✅  |  ✅  |  ✅  | ❌  |
-| [ChatGLM2/ChatGLM3](./chatglm2) |  ❌  |    ✅  |  ✅  |  ✅  |  ✅  | ✅  |
-| [Qwen](./qwen) | ✅ | ✅ | ✅ | ✅ |  🚧 | ✅  |j
-| [Bloom](./bloom) | ❌  | ✅ | ✅ |  ✅ | ✅ | ✅  |
-| [GPT-3](./gpt-3) |   ✅  |  ✅  |    🚧  | 🚧  | 🚧 | ✅  |
-| [OPT](./opt) | 🚧 | ✅ | ✅ | 🚧 |  🚧 | ✅  |
-| [GLM](./glm) | ❌  | ✅ | ✅ | 🚧 |   🚧 | ✅  |
+| Model | Pretrain | SFT | LoRA | Prefix Tuning |  DPO |  Quantization | Weight convert |
+| --- | --- | --- | --- | --- | --- |  --- | --- |
+| [LLaMA](./llama) | ✅  | ✅ | ✅ | ✅ | ✅  | ✅  | ✅  |
+| [Qwen](./qwen) | ✅ | ✅ | ✅ | ✅ | ✅  | 🚧 | ✅  |
+| [Mixtral](./mixtral) | ✅  | ✅ | ✅ | ❌  |  🚧 |🚧 | 🚧  |
+| [Baichuan/Baichuan2](./llama) | ✅  | ✅ | ✅ | ✅ | ✅  | ✅  |  ✅  |
+| [ChatGLM-6B](./chatglm) |  ❌  |  ✅  |    ✅  |  ✅  |  🚧  |  ✅  | ❌  |
+| [ChatGLM2/ChatGLM3](./chatglm2) |  ❌  |    ✅  |  ✅  |  ✅  | 🚧  | ✅  | ✅  |
+| [Bloom](./bloom) | ❌  | ✅ | ✅ |  ✅ |🚧 | ✅ | ✅  |
+| [GPT-3](./gpt-3) |   ✅  |  ✅  |    🚧  | 🚧  |🚧 | 🚧 | ✅  |
+| [OPT](./opt) | 🚧 | ✅ | ✅ | 🚧 |  🚧 |🚧 | ✅  |
 
 * ✅: Supported
 * 🚧: In Progress
@@ -39,7 +39,7 @@
 ##  🚀 快速开始 🚀
 
 ### 1. 预训练
-PaddleNLP将飞桨4D并行策略加入到Trainer API中， 用户只需修改Trainer配置即可使用不同的分布式策略。目前工具链提供[LLaMA/LLaMA2](./llama)、[GPT-3](./gpt-3)、[Qwen](./qwen)、[Baichuan/Baichuan2](./llama) 等模型预训练功能，更多模型支持持续更新中。
+PaddleNLP将飞桨4D并行策略加入到Trainer API中， 用户只需修改Trainer配置即可使用不同的分布式策略。目前工具链提供[LLaMA/LLaMA2](./llama)、[GPT-3](./gpt-3)、[Qwen](./qwen)、[Baichuan/Baichuan2](./llama)、[Mixtral](./mixtral) 等模型预训练功能，更多模型支持持续更新中。
 
 <div align="center">
     <img width="500" alt="llm" src="https://github.com/PaddlePaddle/PaddleNLP/assets/37530985/a2f0261d-7f76-4faf-ae01-cc9d37d5fcc0">
@@ -54,7 +54,7 @@ PaddleNLP将飞桨4D并行策略加入到Trainer API中， 用户只需修改Tra
 我们在此处提供了更详细的[预训练数据制作]()，[分布式策略支持情况]( https://paddlenlp.readthedocs.io/zh/latest/llm/pretraining/index.html#model-capability)，[性能测试报告文档](https://paddlenlp.readthedocs.io/zh/latest/llm/pretraining/index.html#model-performance)，参见: https://paddlenlp.readthedocs.io/zh/latest/llm/pretraining/index.html. 大模型权重列表参见[此处](https://paddlenlp.readthedocs.io/zh/latest/llm/pretraining/index.html#model-weight)
 
 
-此项目支持了LLaMA、GPT-3、BaiChuan、Qwen 等大模型的预训练。用户切换配置config文件，即可一键运行。
+此项目支持了LLaMA、GPT-3、BaiChuan、Qwen、Mixtral 等大模型的预训练。用户切换配置config文件，即可一键运行。
 
 数据详细制作流程可参考[此处](https://paddlenlp.readthedocs.io/zh/latest/llm/pretraining/dataset.html) : https://paddlenlp.readthedocs.io/zh/latest/llm/pretraining/dataset.html
 
@@ -79,30 +79,26 @@ mv llama_openwebtext_100k.idx ./data
 
 ```shell
 # 编译自定义算子，可选
-cd ../model_zoo/gpt-3/external_ops/ && python3 setup.py install && cd -
+cd ..legacy/model_zoo/gpt-3/external_ops/ && python3 setup.py install && cd -
 
-# llama 模型预训练
-python -u  -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_pretrain.py ./llama/pretrain-llama2_7b-tp2sd4_stage2.json
-
-# Qwen 模型预训练
-python -u  -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_pretrain.py ./qwen/pretrain_argument_stage2.json
+# 模型预训练参考
+python -u  -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_pretrain.py ./config/llama/pretrain_argument.json
 ```
 
 注意：
 1. 建议使用paddle develop版本训练，需要安装`pip install tool_helpers visualdl==2.5.3`等相关缺失whl包
 2. `use_flash_attention` 需要在A100机器开启，建议使用cuda11.8环境。
-3. `use_fused_rms_norm` 需要安装[此目录](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/gpt-3/external_ops)下的自定义OP, `python setup.py install`。如果安装后仍然找不到算子，需要额外设置PYTHONPATH
+3. `use_fused_rms_norm` 需要安装自定义算子。如果安装后仍然找不到算子，需要额外设置PYTHONPATH
 4. `continue_training` 表示从现有的预训练模型加载训练。7b模型初始loss大概为2.xx, 随机初始化模型loss从11.x左右下降。
-5. 当前脚本为sharding版本，需要4D并行训练（数据、sharding、张量、流水线并行）的用户，请参考 `run_trainer_tp4pp2.sh`脚本。
-6. 多机训练时，若各机器使用的训练数据文件位置相同（例如挂载共享硬盘情况），请指定`--share_folder true`使全局0号卡制作缓存数据。否则默认各台机器的0号卡独立制作缓存数据，
-7. 若数据集文件夹中存在默认缓存文件夹`index-cache/`，则额外指定的`--data_cache`不生效，训练时优先加载默认缓存文件夹中的内容。
+5. 多机训练时，若各机器使用的训练数据文件位置相同（例如挂载共享硬盘情况），请指定`--share_folder true`使全局0号卡制作缓存数据。否则默认各台机器的0号卡独立制作缓存数据，
+6. 若数据集文件夹中存在默认缓存文件夹`index-cache/`，则额外指定的`--data_cache`不生效，训练时优先加载默认缓存文件夹中的内容。
 
 
 
 ### 2. 精调
 PaddleNLP支持多个主流大模型的SFT、LoRA、Prefix Tuning等精调策略，提供统一、高效精调方案：
 - **统一训练入口**。飞桨大模型套件精调方案可适配业界主流大模型，用户只需修改配置文件，即能在单卡或多卡（支持4D并行分布式策略）进行多种大模型精调。
-- **高效数据和分布式策略**。Zero Padding零填充优化策略有效减少了pad token的占比，提高模型训练效率高达100%。独创PEFT结合低比特和分布式并行策略，大幅降低大模型精调硬件门槛，支持单卡（A100 80G）百亿模型微调、单机（A100 80G * 8）千亿模型微调。
+- **高效数据和分布式策略**。Zero Padding零填充优化策略结合FlashMask策略有效提升模型训练效率。独创PEFT结合低比特和分布式并行策略，大幅降低大模型精调硬件门槛，支持单卡（A100 80G）百亿模型微调、单机（A100 80G * 8）千亿模型微调。
 - **支持多轮对话**。支持统一对话模板，支持多轮对话高效训练，详参[多轮对话文档](./docs/chat_template.md)。
 
 
@@ -137,26 +133,26 @@ tar -zxvf AdvertiseGen.tar.gz
 
 **全参精调：SFT**
 ```bash
-# 四卡llama SFT启动命令参考
-python -u  -m paddle.distributed.launch --gpus "0,1,2,3" finetune_generation.py ./llama/sft_argument.json
+# SFT启动命令参考
+python -u  -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_finetune.py ./config/llama/sft_argument.json
 ```
 
 **LoRA**
 ```bash
-# 单卡llama LoRA启动命令参考
-python  finetune_generation.py ./llama/lora_argument.json
+# LoRA启动命令参考
+python  run_finetune.py ./config/llama/lora_argument.json
 ```
 
 **Prefix Tuning**
 ```bash
-# 单卡llama Prefix Tuning启动命令参考
-python  finetune_generation.py ./llama/pt_argument.json
+# Prefix Tuning启动命令参考
+python  run_finetune.py ./config/llama/pt_argument.json
 ```
 
 更多大模型精调分布式使用文档、训练细节和效果请参见[大模型精调教程](./docs/finetune.md)。
 
 ### 3. 对齐
-我们支持DPO等偏好对齐策略。
+我们支持DPO等偏好对齐策略。DPO策略采用zero_padding策略，结合FlashMask策略，有效提升模型训练效率。
 
 **数据准备**：
 
@@ -189,10 +185,10 @@ wget https://bj.bcebos.com/paddlenlp/datasets/examples/ultrafeedback_binarized.t
 tar -zxvf ultrafeedback_binarized.tar.gz
 ```
 
-**全参精调：SFT**
+**全参DPO**
 ```bash
-# 四卡llama SFT启动命令参考
-python -u  -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" dpo_train.py ./llama/dpo_argument.json
+# DPO启动命令参考
+python -u  -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" ./dpo/run_dpo.py ./config/llama/dpo_argument.json
 ```
 
 ### 4. 量化
@@ -215,10 +211,10 @@ python -u  -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" dpo_train.py ./
 
 ```
 # PTQ 量化启动命令参考
-python  finetune_generation.py ./llama/ptq_argument.json
+python  run_finetune.py ./config/llama/ptq_argument.json
 
 # GPTQ 量化启动命令参考
-python  finetune_generation.py ./llama/ptq_argument.json
+python  run_finetune.py ./config/llama/ptq_argument.json
 ```
 
 更多技术细节和模型量化使用详见[量化文档](./docs/quantization.md)。
@@ -231,13 +227,13 @@ PaddleNLP除了提供常用模型推理外，还提供了高性能推理，内
 
 ```shell
 # 动态图模型推理命令参考
-python predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --data_file ./data/dev.json --dtype float16
+python ./predict/predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --data_file ./data/dev.json --dtype float16
 
 # 静态图模型推理命令参考
 # step1 : 静态图导出
-python export_model.py --model_name_or_path meta-llama/Llama-2-7b-chat --output_path ./inference --dtype float16
+python ./predict/export_model.py --model_name_or_path meta-llama/Llama-2-7b-chat --output_path ./inference --dtype float16
 # step2: 静态图推理
-python predictor.py --model_name_or_path ./inference --data_file ./data/dev.json --dtype float16 --mode static
+python ./predict/predictor.py --model_name_or_path ./inference --data_file ./data/dev.json --dtype float16 --mode static
 ```
 
 - **InferenceModel 高性能推理**：PaddleNLP 还提供了高性能推理模型加快并行推理的速度，同时支持FP16、Prefix Tuning、WINT8、A8W8多种推理方式。
@@ -253,13 +249,13 @@ python predictor.py --model_name_or_path ./inference --data_file ./data/dev.json
 
 ```shell
 # 高性能动态图模型推理命令参考
-python predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --dtype float16
+python ./predict/predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --dtype float16
 
 # 高性能静态图模型推理命令参考
 # step1 : 静态图导出
-python export_model.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --output_path ./inference --dtype float16
+python ./predict/export_model.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --output_path ./inference --dtype float16
 # step2: 静态图推理
-python predictor.py --model_name_or_path ./inference --inference_model --dtype "float16" --mode "static"
+python ./predict/predictor.py --model_name_or_path ./inference --inference_model --dtype "float16" --mode "static"
 ```
 
 更多常用模型推理和高性能模型使用方法详见[大模型推理文档](./docs/inference.md)。
@@ -277,7 +273,7 @@ python predictor.py --model_name_or_path ./inference --inference_model --dtype "
 我们提供了一套基于动态图推理的简单易用UI服务化部署脚本，用户可以快速部署服务化推理。
 
 ```
-python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" flask_server.py \
+python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" ./predict/flask_server.py \
     --model_name_or_path meta-llama/Llama-2-7b-chat \
     --port 8010 \
     --flask_port 8011 \
@@ -287,7 +283,7 @@ python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" flask_server.py \
 - `flask_port`: Flask服务端口号，默认8010。
 - 其他参数请参见[推理文档](./docs/inference.md)中推理参数配置。
 
-此外，如果想通过API脚本的方式跑推理，可参考：`./request_flask_server.py` 文件。
+此外，如果想通过API脚本的方式跑推理，可参考：`./predict/request_flask_server.py` 文件。
 
 </div></details>