|
2 | 2 |
|
3 | 3 | ## Overview
|
4 | 4 |
|
5 |
| -Sleep Mode is the API which can selectively exposed to offload weight, discard kv cache from NPU memory, and it is strong needed for RL post-training workloads, in online PPO (or GRPO, online DPO), the policy model will perform auto-regressive generation (using vLLM or other inference engines) and fwd + bwd computation with training infrastructure. Therefore, in the training stage, so it is necessary to free the KVCache and even offload the model parameter stored in the vLLM (as the model parallel strategies during generation and training could be different). |
| 5 | +Sleep Mode is an API designed to offload model weights and discard KV cache from NPU memory. This functionality is essential for reinforcement learning (RL) post-training workloads, particularly in online algorithms such as PPO, GRPO, or DPO. During training, the policy model typically performs auto-regressive generation using inference engines like vLLM, followed by forward and backward passes for optimization. |
6 | 6 |
|
7 |
| -## Getting started |
| 7 | +Since the generation and training phases may employ different model parallelism strategies, it becomes crucial to free KV cache and even offload model parameters stored within vLLM during training. This ensures efficient memory utilization and avoids resource contention on the NPU. |
8 | 8 |
|
9 |
| -This module provides a custom memory allocator for Ascend NPUs using the [CANN](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/API/appdevgapi/appdevgapi_07_0000.html) runtime. It integrates tightly with PyTorch via `torch.npu.memory.NPUPluggableAllocator` and supports a "sleep mode", which allows tensors to offload memory to the CPU and release NPU memory when it's no longer immediately needed. This improves memory efficiency and allows large-scale inference to run in constrained environments. |
10 | 9 |
|
11 |
| -With `enable_sleep_mode=True`, the way we manage memory(malloc, free) in vllm will under the `use_memory_pool` Context Managers, and all memory allocation created inside the context will be allocated in the memory pool, and has the specified tag. |
| 10 | +## Getting started |
| 11 | + |
| 12 | +With `enable_sleep_mode=True`, the way we manage memory(malloc, free) in vllm will under the management of a specific memory pool, during loading model weight and initialize kv_caches, we tag the memory as a map: `{"weight": data, "kv_cache": data}` |
12 | 13 |
|
13 |
| -```bash |
14 |
| -+-------------------+ +---------------------------+ +----------------------------+ |
15 |
| -| Python Layer | -----> | CaMemAllocator (class) | ---> | C Extension (vllm_ascend_C)| |
16 |
| -+-------------------+ +---------------------------+ +----------------------------+ |
17 |
| - ⬇ Registers ⬇ Tracks & Tags ⬇ Calls into CANN |
18 |
| -init_module(malloc, free) pointer_to_data[ptr] = data aclrtMallocPhysical, aclrtMapMem, etc. |
19 |
| -``` |
20 | 14 |
|
21 | 15 | Since this feature uses the AscendCL API, in order to use sleep mode, you should follow the [installation guide](https://vllm-ascend.readthedocs.io/en/latest/installation.html) and building from source, if you are using v0.7.3, remember to set `export COMPILE_CUSTOM_KERNELS=1`, for the latest version(v0.9.x+), the environment variable COMPILE_CUSTOM_KERNELS will be set 1 by default while building from source.
|
22 | 16 |
|
|
0 commit comments