You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/user_guide/sleep_mode.md
+48-13Lines changed: 48 additions & 13 deletions
Original file line number
Diff line number
Diff line change
@@ -9,17 +9,30 @@ Since the generation and training phases may employ different model parallelism
9
9
10
10
## Getting started
11
11
12
-
With `enable_sleep_mode=True`, the way we manage memory(malloc, free) in vllm will under the management of a specific memory pool, during loading model weight and initialize kv_caches, we tag the memory as a map: `{"weight": data, "kv_cache": data}`
12
+
With `enable_sleep_mode=True`, the way we manage memory(malloc, free) in vllm will under the management of a specific memory pool, during loading model weight and initialize kv_caches, we tag the memory as a map: `{"weight": data, "kv_cache": data}`.
13
13
14
+
The engine(v0/v1) supports two sleep levels to manage memory during idle periods:
14
15
15
-
Since this feature uses the AscendCL API, in order to use sleep mode, you should follow the [installation guide](https://vllm-ascend.readthedocs.io/en/latest/installation.html) and building from source, if you are using v0.7.3, remember to set `export COMPILE_CUSTOM_KERNELS=1`, for the latest version(v0.9.x+), the environment variable COMPILE_CUSTOM_KERNELS will be set 1 by default while building from source.
16
+
- Level 1 Sleep
17
+
- Action: Offloads model weights and discards the KV cache.
18
+
- Memory: Model weights are moved to CPU memory; KV cache is forgotten.
19
+
- Use Case: Suitable when reusing the same model later.
20
+
- Note: Ensure sufficient CPU memory is available to hold the model weights.
21
+
22
+
- Level 2 Sleep
23
+
- Action: Discards both model weights and KV cache.
24
+
- Memory: The content of both the model weights and kv cache is forgotten.
25
+
- Use Case: Ideal when switching to a different model or updating the current one.
26
+
27
+
Since this feature uses the low-level API [AscendCL](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/API/appdevgapi/appdevgapi_07_0000.html), in order to use sleep mode, you should follow the [installation guide](https://vllm-ascend.readthedocs.io/en/latest/installation.html) and building from source, if you are using v0.7.3, remember to set `export COMPILE_CUSTOM_KERNELS=1`, for the latest version(v0.9.x+), the environment variable `COMPILE_CUSTOM_KERNELS` will be set 1 by default while building from source.
16
28
17
29
## Usage
18
30
19
-
Let's take the default parameters of v1 engine as an example
31
+
The following is a simple example of how to use sleep mode.
Considering there may be a risk of malicious access, please make sure you are under a dev-mode, and explicit specify the develop env: `VLLM_SERVER_DEV_MODE` to expose these endpoints(sleep/wake up).
0 commit comments