Skip to content

feat(serve): add --generation-config CLI for server sampling defaults#4708

Open
lvhan028 wants to merge 1 commit into
InternLM:mainfrom
lvhan028:feat/generation-config-cli
Open

feat(serve): add --generation-config CLI for server sampling defaults#4708
lvhan028 wants to merge 1 commit into
InternLM:mainfrom
lvhan028:feat/generation-config-cli

Conversation

@lvhan028

Copy link
Copy Markdown
Collaborator

No description provided.

Align api_server with vLLM by loading HuggingFace generation_config.json
as default sampling params, with optional override and lmdeploy fallback.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copilot AI review requested due to automatic review settings June 25, 2026 08:58

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a server-side “generation config” mechanism to centralize sampling defaults (and an optional server-wide max_new_tokens cap) for the serving stack, wiring it through OpenAI/Responses/Anthropic request handling and exposing it via new CLI flags.

Changes:

  • Introduces lmdeploy.serve.core.generation_config helpers to load HF generation_config.json, merge request/server defaults, and build GenerationConfig.
  • Updates OpenAI/Responses/Anthropic serving code to use merged sampling defaults and adjusts protocol model defaults to None so request fields are only applied when explicitly provided.
  • Adds CLI flags --generation-config and --override-generation-config plus unit tests for the merge/resolution logic.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/test_lmdeploy/serve/test_generation_config.py Adds unit tests for sampling merge, max token resolution, and server-default resolution.
lmdeploy/serve/openai/serving_completion.py Validates sampling values after merging request/server/fallback defaults.
lmdeploy/serve/openai/serving_chat_completion.py Same as above for chat-completions validation.
lmdeploy/serve/openai/responses/serving.py Passes server defaults/cap into Responses to_generation_config.
lmdeploy/serve/openai/responses/request.py Rebuilds Responses GenerationConfig via shared generation-config helpers.
lmdeploy/serve/openai/responses/protocol.py Sets certain sampling fields to default None to enable server defaults.
lmdeploy/serve/openai/protocol.py Sets multiple sampling-related request defaults to None to enable server defaults.
lmdeploy/serve/openai/api_server.py Centralizes GenerationConfig construction and wires server sampling defaults/cap into request handling.
lmdeploy/serve/core/generation_config.py New core module implementing config loading, merging, and GenerationConfig building.
lmdeploy/serve/anthropic/protocol.py Sets temperature default to None to enable server defaults.
lmdeploy/serve/anthropic/endpoints/messages.py Passes server defaults/cap into Anthropic to_generation_config.
lmdeploy/serve/anthropic/adapter.py Rebuilds Anthropic GenerationConfig via shared generation-config helpers.
lmdeploy/cli/utils.py Adds CLI args for generation config source and overrides.
lmdeploy/cli/serve.py Wires new CLI args through to server launch entrypoints.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +69 to +70
override = override or {}
src = generation_config
Comment on lines +82 to +85
override_max_new_tokens = sampling.pop('max_new_tokens', None)
if override_max_new_tokens is not None:
override_max_new_tokens = int(override_max_new_tokens)

Comment on lines +138 to +142
request_value = max_completion_tokens if max_completion_tokens is not None else max_tokens
if request_value is None:
return server_cap
if server_cap is not None:
return min(request_value, server_cap)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants