Skip to content

Commit 6269537

Browse files
committed
fix doc
Signed-off-by: wangli <[email protected]>
1 parent 617dabb commit 6269537

File tree

1 file changed

+28
-0
lines changed

1 file changed

+28
-0
lines changed

docs/source/user_guide/sleep_mode.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@ Sleep Mode is the API which can selectively exposed to offload weight, discard k
88

99
This module provides a custom memory allocator for Ascend NPUs using the [CANN](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/API/appdevgapi/appdevgapi_07_0000.html) runtime. It integrates tightly with PyTorch via `torch.npu.memory.NPUPluggableAllocator` and supports a "sleep mode", which allows tensors to offload memory to the CPU and release NPU memory when it's no longer immediately needed. This improves memory efficiency and allows large-scale inference to run in constrained environments.
1010

11+
With `enable_sleep_mode=True`, the way we manage memory(malloc, free) in vllm will under the `use_memory_pool` Context Managers, and all memory allocation created inside the context will be allocated, in the memory pool, and has the specified tag.
12+
1113
```bash
1214
+-------------------+ +---------------------------+ +----------------------------+
1315
| Python Layer | -----> | CaMemAllocator (class) | ---> | C Extension (vllm_ascend_C)|
@@ -22,6 +24,7 @@ Since this feature uses the AscendCL API, in order to use sleep mode, you should
2224

2325
Let's take the default parameters of v1 engine as an example
2426

27+
- For offline inference:
2528
```python
2629
import os
2730

@@ -59,3 +62,28 @@ if __name__ == "__main__":
5962
# cmp output
6063
assert output[0].outputs[0].text == output2[0].outputs[0].text
6164
```
65+
66+
- For online serving:
67+
Considering there may be a risk of malicious access, please make sure you are under a dev-mode, and explicit specify the develop env: `VLLM_SERVER_DEV_MODE` to expose these endpoints(sleep/wake up).
68+
69+
server command:
70+
```bash
71+
export VLLM_SERVER_DEV_MODE="1"
72+
export VLLM_USE_V1="1"
73+
export VLLM_WORKER_MULTIPROC_METHOD="spawn"
74+
export VLLM_USE_MODELSCOPE="True"
75+
76+
vllm serve Qwen/Qwen2.5-0.5B-Instruct --enable-sleep-mode
77+
78+
# after serveing is up, post these endpoints
79+
80+
# 1. sleep
81+
curl -X POST http://127.0.0.1:8000/sleep
82+
83+
curl -X POST http://127.0.0.1:8000/is_sleeping
84+
85+
# 2. wake up
86+
curl -X POST http://127.0.0.1:8000/wake_up
87+
curl -X POST http://127.0.0.1:8000/is_sleeping
88+
89+
```

0 commit comments

Comments
 (0)