[Feature] support reward model #5216

lizexu123 · 2025-11-25T07:24:37Z

Motivation

支持reward模型

Modifications

支持reward模型。

Usage or Command

服务启动方式:

python -m fastdeploy.entrypoints.openai.api_server \
    --model ${model_path} \
    --max-num-seqs 256 \
    --max-model-len 32768 \
    --port 13351 \
    --engine-worker-queue-port 7562 \
    --metrics-port 7531 \
    --tensor-parallel-size 8 \
    --gpu-memory-utilization 0.9 \
    --graph-optimization-config '{"use_cudagraph":false}' \
    --quantization "wint8" \
    --load-choices "default_v1" \
    --runner pooling \
    --convert reward

请求方式:

curl --location 'http://0.0.0.0:13351/v1/reward' \
--header 'Content-Type: application/json' \
--data-raw '{
    "model": "default",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "北京天安门在哪里？"
                }
            ]
        },
        {
            "role": "assistant",
            "content": [
                {
                    "type": "text",
                    "text": "北京天安门在中国北京故宫的前面。"
                }
            ]
        }
    ],
    "user": "user-123"
}'

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2025-11-25T07:24:43Z

Thanks for your contribution!

…into support_pooling_5

codecov-commenter · 2025-11-25T15:16:03Z

Codecov Report

❌ Patch coverage is 58.33333% with 20 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@cfc5b0c). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/model_executor/models/ernie_vl_rm.py	16.66%	5 Missing ⚠️
fastdeploy/model_executor/layers/pooler.py	66.66%	3 Missing ⚠️
fastdeploy/engine/request.py	66.66%	1 Missing and 1 partial ⚠️
fastdeploy/model_executor/layers/pool/metadata.py	75.00%	1 Missing and 1 partial ⚠️
...del_executor/models/ernie4_5_vl/ernie4_5_vl_moe.py	33.33%	1 Missing and 1 partial ⚠️
fastdeploy/model_executor/pre_and_post_process.py	0.00%	1 Missing and 1 partial ⚠️
fastdeploy/worker/gpu_model_runner.py	60.00%	1 Missing and 1 partial ⚠️
fastdeploy/engine/pooling_params.py	0.00%	1 Missing ⚠️
...y/model_executor/model_loader/default_loader_v1.py	0.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #5216   +/-   ##
==========================================
  Coverage           ?   60.99%           
==========================================
  Files              ?      317           
  Lines              ?    38831           
  Branches           ?     5856           
==========================================
  Hits               ?    23685           
  Misses             ?    13278           
  Partials           ?     1868

Flag	Coverage Δ
GPU	`60.99% <58.33%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…into support_pooling_5

Jiang-Jia-Jun · 2025-11-27T11:33:25Z

需要补充一下pooling model部署的使用文档并完善参数说明（包括启动参数，请求参数）

Copilot

Pull request overview

这个 PR 添加了对 reward 模型的支持，扩展了 FastDeploy 的能力以支持奖励模型评分。该功能允许模型评估用户-助手对话对的质量。

主要变更包括：

在配置系统中添加 "reward" 作为新的任务类型和转换选项
实现 reward 模型的池化逻辑，支持 LAST pooling 类型
添加新的 /v1/reward API 端点和相应的测试

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`tests/pooling/test_Qwen3-Embedding_serving.py`	将模型路径未找到时的行为从 skip 改为 raise 错误
`tests/pooling/test_Ernie4_5_reward_serving.py`	添加 reward 模型服务的完整测试套件，包括基线比较
`fastdeploy/worker/gpu_model_runner.py`	修改池化逻辑以支持 reward 模型，添加对池化模型的 max_tokens 处理
`fastdeploy/model_executor/pre_and_post_process.py`	添加空值检查以处理部分 pooler 输出
`fastdeploy/model_executor/models/model_base.py`	将 REWARD 类别添加到池化模型支持中
`fastdeploy/model_executor/models/ernie_vl_rm.py`	实现 reward 模型，添加多任务池化支持，切换到 LAST pooling
`fastdeploy/model_executor/models/ernie4_5_vl/ernie4_5_vl_moe.py`	为 reward 模型添加 float32 normalization 和权重映射
`fastdeploy/model_executor/models/adapters.py`	在 adapter 中添加 reward pooler
`fastdeploy/model_executor/model_loader/default_loader_v1.py`	将 reward 类型转换映射到 embedding 模型
`fastdeploy/model_executor/layers/pooler.py`	添加 reward 池化支持，更新 pooler 工厂方法和 StepPooler
`fastdeploy/model_executor/layers/pool/metadata.py`	更新 device 参数类型从字符串改为 Place 对象
`fastdeploy/entrypoints/openai/serving_engine.py`	从 chat template kwargs 中移除 add_generation_prompt
`fastdeploy/entrypoints/openai/protocol.py`	从 EmbeddingChatRequest 中移除 add_generation_prompt 字段
`fastdeploy/engine/request.py`	为池化请求自动禁用 thinking 模式
`fastdeploy/engine/pooling_params.py`	将 reward 任务的默认 normalize 设置为 False
`fastdeploy/engine/args_utils.py`	为池化 runner 禁用 prefix caching
`fastdeploy/config.py`	添加 reward 任务类型定义和配置映射

Copilot · 2025-11-27T12:22:31Z

fastdeploy/worker/gpu_model_runner.py

+        for seq_len, prompt_len in zip(seq_lens_cpu, pooling_metadata.prompt_lens):
+            output = raw_pooler_output[0].data if int(seq_len) == int(prompt_len) else None


This code has a critical bug. The original code iterated over both raw_pooler_output and other variables in the zip, extracting .data from each raw_output element. The modified code only accesses raw_pooler_output[0].data for all iterations, meaning all pooler outputs will incorrectly reference the same first element.

Additionally, if raw_pooler_output is a list of tensors (which is what the head returns), then raw_pooler_output[0] would be a tensor, and .data is not a valid attribute on PaddlePaddle tensors.

建议恢复原来的迭代逻辑：

for raw_output, seq_len, prompt_len in zip(raw_pooler_output, seq_lens_cpu, pooling_metadata.prompt_lens): output = raw_output if int(seq_len) == int(prompt_len) else None pooler_output.append(output)

Suggested change

for seq_len, prompt_len in zip(seq_lens_cpu, pooling_metadata.prompt_lens):

output = raw_pooler_output[0].data if int(seq_len) == int(prompt_len) else None

for raw_output, seq_len, prompt_len in zip(raw_pooler_output, seq_lens_cpu, pooling_metadata.prompt_lens):

output = raw_output if int(seq_len) == int(prompt_len) else None

Copilot · 2025-11-27T12:22:31Z

fastdeploy/model_executor/layers/pool/metadata.py

-    num_scheduled_tokens = paddle.to_tensor(num_scheduled_tokens)
-    cumsum = paddle.zeros([n_seq + 1], dtype="int64")
+    num_scheduled_tokens = paddle.to_tensor(num_scheduled_tokens, dtype="int64", place=paddle.CPUPlace())
+    cumsum = paddle.zeros([n_seq + 1], dtype="int64", device=paddle.CPUPlace())


paddle.zeros does not accept a device parameter. The correct parameter name is place. This will cause a runtime error.

应该改为：

cumsum = paddle.zeros([n_seq + 1], dtype="int64", place=paddle.CPUPlace())

Suggested change

cumsum = paddle.zeros([n_seq + 1], dtype="int64", device=paddle.CPUPlace())

cumsum = paddle.zeros([n_seq + 1], dtype="int64", place=paddle.CPUPlace())

Copilot · 2025-11-27T12:22:32Z

fastdeploy/model_executor/layers/pool/metadata.py

-    num_scheduled_tokens = paddle.to_tensor(num_scheduled_tokens)
-    cumsum = paddle.zeros([n_seq + 1], dtype="int64")
+    num_scheduled_tokens = paddle.to_tensor(num_scheduled_tokens, dtype="int64", place=paddle.CPUPlace())
+    cumsum = paddle.zeros([n_seq + 1], dtype="int64", device=paddle.CPUPlace())


paddle.to_tensor does not accept a device parameter. The correct parameter name is place.

应该改为：

num_scheduled_tokens = paddle.to_tensor(num_scheduled_tokens, dtype="int64", place=paddle.CPUPlace())

Suggested change

cumsum = paddle.zeros([n_seq + 1], dtype="int64", device=paddle.CPUPlace())

cumsum = paddle.zeros([n_seq + 1], dtype="int64", place=paddle.CPUPlace())

Copilot · 2025-11-27T12:22:32Z

tests/pooling/test_Ernie4_5_reward_serving.py

+@pytest.fixture
+def consistent_payload():
+    """
+    Returns a fixed payload for reward model consistency testing.
+    Reward models evaluate user-assistant conversation pairs.
+    """
+    return {
+        "model": "default",
+        "messages": [
+            {"role": "user", "content": [{"type": "text", "text": "北京天安门在哪里？"}]},
+            {
+                "role": "assistant",
+                "content": [{"type": "text", "text": "北京天安门在中国北京故宫的前面。"}],
+            },
+        ],
+        "user": "test-user-123",
+    }
+
+


The fixture consistent_payload is defined but never used in any test. If it's intended for future use, consider adding a comment explaining its purpose. Otherwise, it should be removed to reduce code clutter.

如果不需要该 fixture，建议删除它以简化代码。

Suggested change

@pytest.fixture

def consistent_payload():

"""

Returns a fixed payload for reward model consistency testing.

Reward models evaluate user-assistant conversation pairs.

"""

return {

"model": "default",

"messages": [

{"role": "user", "content": [{"type": "text", "text": "北京天安门在哪里？"}]},

{

"role": "assistant",

"content": [{"type": "text", "text": "北京天安门在中国北京故宫的前面。"}],

},

],

"user": "test-user-123",

}

# (lines 157–174 removed; no replacement needed)

Copilot · 2025-11-27T12:22:33Z

tests/pooling/test_Ernie4_5_reward_serving.py

+
+    print(f"\n=== Sending request to {reward_api_url} ===")
+
+    # 发送HTTP请求


该行注释是中文，但代码库中其他注释使用英文。建议统一使用英文注释以保持代码风格一致。

建议改为：# Send HTTP request

Suggested change

# 发送HTTP请求

# Send HTTP request

Copilot · 2025-11-27T12:22:33Z

tests/pooling/test_Ernie4_5_reward_serving.py

+        for pid in output.splitlines():
+            os.kill(int(pid), signal.SIGKILL)
+            print(f"Killed process on port {port}, pid={pid}")
+    except subprocess.CalledProcessError:


'except' clause does nothing but pass and there is no explanatory comment.

Suggested change

except subprocess.CalledProcessError:

except subprocess.CalledProcessError:

# No process is listening on the port, so nothing to kill.

…into support_pooling_5

Your commit message here

16b9399

lizexu123 added 6 commits November 25, 2025 17:52

add test

d8a0f52

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

1c2a228

…into support_pooling_5

update develop

63df44b

support reward

7cca89f

update develop

a0c81e3

support enable_chunk_prefill

af3b93b

lizexu123 added 6 commits November 26, 2025 22:03

support bingfa

fa45a91

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

766e58b

…into support_pooling_5

support convert is reward

8868d8e

update test

3a289c9

delete print

921e04d

fix enable_thinking

0a07749

Jiang-Jia-Jun closed this Nov 27, 2025

Jiang-Jia-Jun reopened this Nov 27, 2025

yuanlehome requested a review from Copilot November 27, 2025 12:16

Copilot started reviewing on behalf of yuanlehome November 27, 2025 12:17 View session

Copilot finished reviewing on behalf of yuanlehome November 27, 2025 12:21

Copilot AI reviewed Nov 27, 2025

View reviewed changes

lizexu123 added 5 commits November 27, 2025 20:53

add document

dd6cb23

fix place

3df2899

fix test

8023a66

fix

d0c4151

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

a775f9a

…into support_pooling_5

		for seq_len, prompt_len in zip(seq_lens_cpu, pooling_metadata.prompt_lens):
		output = raw_pooler_output[0].data if int(seq_len) == int(prompt_len) else None

	cumsum = paddle.zeros([n_seq + 1], dtype="int64", device=paddle.CPUPlace())
	cumsum = paddle.zeros([n_seq + 1], dtype="int64", place=paddle.CPUPlace())


		print(f"\n=== Sending request to {reward_api_url} ===")

		# 发送HTTP请求

	except subprocess.CalledProcessError:
	except subprocess.CalledProcessError:
	# No process is listening on the port, so nothing to kill.

[Feature] support reward model #5216

Are you sure you want to change the base?

[Feature] support reward model #5216

Conversation

lizexu123 commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Nov 25, 2025

Uh oh!

codecov-commenter commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Jiang-Jia-Jun commented Nov 27, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

lizexu123 Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

lizexu123 Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

lizexu123 Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

lizexu123 Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lizexu123 commented Nov 25, 2025 •

edited

Loading

codecov-commenter commented Nov 25, 2025 •

edited

Loading