Skip to content

Commit 5be9ad1

Browse files
committed
[FP8][Kernel] add envs.VLLM_USE_CUTLASS_MOE_FP8
A flag named `VLLM_USE_CUTLASS_MOE_FP8` controls whether to activate the Cutlass kernel. By default, this flag is disabled, ensuring the original execution path remains completely untouched. Usage: $ VLLM_USE_CUTLASS_MOE_FP8=1 python3 -m vllm.entrypoints.openai.api_server ...
1 parent 3015d56 commit 5be9ad1

File tree

1 file changed

+6
-0
lines changed

1 file changed

+6
-0
lines changed

vllm/envs.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -109,6 +109,7 @@
109109
VLLM_TPU_BUCKET_PADDING_GAP: int = 0
110110
VLLM_USE_DEEP_GEMM: bool = False
111111
VLLM_XGRAMMAR_CACHE_MB: int = 0
112+
VLLM_USE_CUTLASS_MOE_FP8: bool = False
112113
VLLM_MSGPACK_ZERO_COPY_THRESHOLD: int = 256
113114

114115

@@ -718,6 +719,11 @@ def maybe_convert_int(value: Optional[str]) -> Optional[int]:
718719
"VLLM_XGRAMMAR_CACHE_MB":
719720
lambda: int(os.getenv("VLLM_XGRAMMAR_CACHE_MB", "512")),
720721

722+
# Flag to control if vllm should use CUTLASS kernel for MoE FP8
723+
"VLLM_USE_CUTLASS_MOE_FP8":
724+
lambda: (os.environ.get("VLLM_USE_CUTLASS_MOE_FP8", "False").lower() in
725+
("true", "1")),
726+
721727
# Control the threshold for msgspec to use 'zero copy' for
722728
# serialization/deserialization of tensors. Tensors below
723729
# this limit will be encoded into the msgpack buffer, and

0 commit comments

Comments
 (0)