Skip to content

Commit 8ff9b88

Browse files
committed
fix doc
Signed-off-by: wangli <[email protected]>
1 parent 106530b commit 8ff9b88

File tree

1 file changed

+5
-11
lines changed

1 file changed

+5
-11
lines changed

docs/source/user_guide/sleep_mode.md

Lines changed: 5 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -2,21 +2,15 @@
22

33
## Overview
44

5-
Sleep Mode is the API which can selectively exposed to offload weight, discard kv cache from NPU memory, and it is strong needed for RL post-training workloads, in online PPO (or GRPO, online DPO), the policy model will perform auto-regressive generation (using vLLM or other inference engines) and fwd + bwd computation with training infrastructure. Therefore, in the training stage, so it is necessary to free the KVCache and even offload the model parameter stored in the vLLM (as the model parallel strategies during generation and training could be different).
5+
Sleep Mode is an API designed to offload model weights and discard KV cache from NPU memory. This functionality is essential for reinforcement learning (RL) post-training workloads, particularly in online algorithms such as PPO, GRPO, or DPO. During training, the policy model typically performs auto-regressive generation using inference engines like vLLM, followed by forward and backward passes for optimization.
66

7-
## Getting started
7+
Since the generation and training phases may employ different model parallelism strategies, it becomes crucial to free KV cache and even offload model parameters stored within vLLM during training. This ensures efficient memory utilization and avoids resource contention on the NPU.
88

9-
This module provides a custom memory allocator for Ascend NPUs using the [CANN](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/API/appdevgapi/appdevgapi_07_0000.html) runtime. It integrates tightly with PyTorch via `torch.npu.memory.NPUPluggableAllocator` and supports a "sleep mode", which allows tensors to offload memory to the CPU and release NPU memory when it's no longer immediately needed. This improves memory efficiency and allows large-scale inference to run in constrained environments.
109

11-
With `enable_sleep_mode=True`, the way we manage memory(malloc, free) in vllm will under the `use_memory_pool` Context Managers, and all memory allocation created inside the context will be allocated in the memory pool, and has the specified tag.
10+
## Getting started
11+
12+
With `enable_sleep_mode=True`, the way we manage memory(malloc, free) in vllm will under the management of a specific memory pool, during loading model weight and initialize kv_caches, we tag the memory as a map: `{"weight": data, "kv_cache": data}`
1213

13-
```bash
14-
+-------------------+ +---------------------------+ +----------------------------+
15-
| Python Layer | -----> | CaMemAllocator (class) | ---> | C Extension (vllm_ascend_C)|
16-
+-------------------+ +---------------------------+ +----------------------------+
17-
⬇ Registers ⬇ Tracks & Tags ⬇ Calls into CANN
18-
init_module(malloc, free) pointer_to_data[ptr] = data aclrtMallocPhysical, aclrtMapMem, etc.
19-
```
2014

2115
Since this feature uses the AscendCL API, in order to use sleep mode, you should follow the [installation guide](https://vllm-ascend.readthedocs.io/en/latest/installation.html) and building from source, if you are using v0.7.3, remember to set `export COMPILE_CUSTOM_KERNELS=1`, for the latest version(v0.9.x+), the environment variable COMPILE_CUSTOM_KERNELS will be set 1 by default while building from source.
2216

0 commit comments

Comments
 (0)