Skip to content

Commit 592fb78

Browse files
committed
fix doc
Signed-off-by: wangli <[email protected]>
1 parent 8ff9b88 commit 592fb78

File tree

1 file changed

+48
-13
lines changed

1 file changed

+48
-13
lines changed

docs/source/user_guide/sleep_mode.md

Lines changed: 48 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -9,17 +9,30 @@ Since the generation and training phases may employ different model parallelism
99

1010
## Getting started
1111

12-
With `enable_sleep_mode=True`, the way we manage memory(malloc, free) in vllm will under the management of a specific memory pool, during loading model weight and initialize kv_caches, we tag the memory as a map: `{"weight": data, "kv_cache": data}`
12+
With `enable_sleep_mode=True`, the way we manage memory(malloc, free) in vllm will under the management of a specific memory pool, during loading model weight and initialize kv_caches, we tag the memory as a map: `{"weight": data, "kv_cache": data}`.
1313

14+
The engine(v0/v1) supports two sleep levels to manage memory during idle periods:
1415

15-
Since this feature uses the AscendCL API, in order to use sleep mode, you should follow the [installation guide](https://vllm-ascend.readthedocs.io/en/latest/installation.html) and building from source, if you are using v0.7.3, remember to set `export COMPILE_CUSTOM_KERNELS=1`, for the latest version(v0.9.x+), the environment variable COMPILE_CUSTOM_KERNELS will be set 1 by default while building from source.
16+
- Level 1 Sleep
17+
- Action: Offloads model weights and discards the KV cache.
18+
- Memory: Model weights are moved to CPU memory; KV cache is forgotten.
19+
- Use Case: Suitable when reusing the same model later.
20+
- Note: Ensure sufficient CPU memory is available to hold the model weights.
21+
22+
- Level 2 Sleep
23+
- Action: Discards both model weights and KV cache.
24+
- Memory: The content of both the model weights and kv cache is forgotten.
25+
- Use Case: Ideal when switching to a different model or updating the current one.
26+
27+
Since this feature uses the low-level API [AscendCL](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/API/appdevgapi/appdevgapi_07_0000.html), in order to use sleep mode, you should follow the [installation guide](https://vllm-ascend.readthedocs.io/en/latest/installation.html) and building from source, if you are using v0.7.3, remember to set `export COMPILE_CUSTOM_KERNELS=1`, for the latest version(v0.9.x+), the environment variable `COMPILE_CUSTOM_KERNELS` will be set 1 by default while building from source.
1628

1729
## Usage
1830

19-
Let's take the default parameters of v1 engine as an example
31+
The following is a simple example of how to use sleep mode.
2032

21-
- For offline inference:
22-
```python
33+
- offline inference:
34+
35+
````python
2336
import os
2437

2538
import torch
@@ -55,12 +68,13 @@ if __name__ == "__main__":
5568
output2 = llm.generate(prompt, sampling_params)
5669
# cmp output
5770
assert output[0].outputs[0].text == output2[0].outputs[0].text
58-
```
71+
````
5972

60-
- For online serving:
73+
- online serving:
74+
:::{note}
6175
Considering there may be a risk of malicious access, please make sure you are under a dev-mode, and explicit specify the develop env: `VLLM_SERVER_DEV_MODE` to expose these endpoints(sleep/wake up).
76+
:::
6277

63-
server command:
6478
```bash
6579
export VLLM_SERVER_DEV_MODE="1"
6680
export VLLM_USE_V1="1"
@@ -71,13 +85,34 @@ vllm serve Qwen/Qwen2.5-0.5B-Instruct --enable-sleep-mode
7185

7286
# after serveing is up, post these endpoints
7387

74-
# 1. sleep
75-
curl -X POST http://127.0.0.1:8000/sleep
88+
# sleep level 1
89+
curl -X POST http://127.0.0.1:8000/sleep \
90+
-H "Content-Type: application/json" \
91+
-d '{"level": "1"}'
92+
93+
curl -X GET http://127.0.0.1:8000/is_sleeping
7694

77-
curl -X POST http://127.0.0.1:8000/is_sleeping
95+
# sleep level 2
96+
curl -X POST http://127.0.0.1:8000/sleep \
97+
-H "Content-Type: application/json" \
98+
-d '{"level": "2"}'
7899

79-
# 2. wake up
100+
# wake up
80101
curl -X POST http://127.0.0.1:8000/wake_up
81-
curl -X POST http://127.0.0.1:8000/is_sleeping
102+
103+
# wake up with tag, tags must be in ["weights", "kv_cache"]
104+
curl -X POST "http://127.0.0.1:8000/wake_up?tags=weights"
105+
106+
curl -X GET http://127.0.0.1:8000/is_sleeping
107+
108+
# after sleep and wake up, the serving is still available
109+
curl http://localhost:8000/v1/completions \
110+
-H "Content-Type: application/json" \
111+
-d '{
112+
"model": "Qwen/Qwen3-8B",
113+
"prompt": "The future of AI is",
114+
"max_tokens": 7,
115+
"temperature": 0
116+
}'
82117

83118
```

0 commit comments

Comments
 (0)