feat(serve): add --generation-config CLI for server sampling defaults#4708
Open
lvhan028 wants to merge 1 commit into
Open
feat(serve): add --generation-config CLI for server sampling defaults#4708lvhan028 wants to merge 1 commit into
lvhan028 wants to merge 1 commit into
Conversation
Align api_server with vLLM by loading HuggingFace generation_config.json as default sampling params, with optional override and lmdeploy fallback. Co-authored-by: Cursor <cursoragent@cursor.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a server-side “generation config” mechanism to centralize sampling defaults (and an optional server-wide max_new_tokens cap) for the serving stack, wiring it through OpenAI/Responses/Anthropic request handling and exposing it via new CLI flags.
Changes:
- Introduces
lmdeploy.serve.core.generation_confighelpers to load HFgeneration_config.json, merge request/server defaults, and buildGenerationConfig. - Updates OpenAI/Responses/Anthropic serving code to use merged sampling defaults and adjusts protocol model defaults to
Noneso request fields are only applied when explicitly provided. - Adds CLI flags
--generation-configand--override-generation-configplus unit tests for the merge/resolution logic.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_lmdeploy/serve/test_generation_config.py | Adds unit tests for sampling merge, max token resolution, and server-default resolution. |
| lmdeploy/serve/openai/serving_completion.py | Validates sampling values after merging request/server/fallback defaults. |
| lmdeploy/serve/openai/serving_chat_completion.py | Same as above for chat-completions validation. |
| lmdeploy/serve/openai/responses/serving.py | Passes server defaults/cap into Responses to_generation_config. |
| lmdeploy/serve/openai/responses/request.py | Rebuilds Responses GenerationConfig via shared generation-config helpers. |
| lmdeploy/serve/openai/responses/protocol.py | Sets certain sampling fields to default None to enable server defaults. |
| lmdeploy/serve/openai/protocol.py | Sets multiple sampling-related request defaults to None to enable server defaults. |
| lmdeploy/serve/openai/api_server.py | Centralizes GenerationConfig construction and wires server sampling defaults/cap into request handling. |
| lmdeploy/serve/core/generation_config.py | New core module implementing config loading, merging, and GenerationConfig building. |
| lmdeploy/serve/anthropic/protocol.py | Sets temperature default to None to enable server defaults. |
| lmdeploy/serve/anthropic/endpoints/messages.py | Passes server defaults/cap into Anthropic to_generation_config. |
| lmdeploy/serve/anthropic/adapter.py | Rebuilds Anthropic GenerationConfig via shared generation-config helpers. |
| lmdeploy/cli/utils.py | Adds CLI args for generation config source and overrides. |
| lmdeploy/cli/serve.py | Wires new CLI args through to server launch entrypoints. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+69
to
+70
| override = override or {} | ||
| src = generation_config |
Comment on lines
+82
to
+85
| override_max_new_tokens = sampling.pop('max_new_tokens', None) | ||
| if override_max_new_tokens is not None: | ||
| override_max_new_tokens = int(override_max_new_tokens) | ||
|
|
Comment on lines
+138
to
+142
| request_value = max_completion_tokens if max_completion_tokens is not None else max_tokens | ||
| if request_value is None: | ||
| return server_cap | ||
| if server_cap is not None: | ||
| return min(request_value, server_cap) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.