-
-
Notifications
You must be signed in to change notification settings - Fork 12.9k
[Kernel] [Helion] Helion kernel wrapper #32964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com>
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com>
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com>
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com>
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request introduces the Helion kernel wrapper, configuration management, and utility functions. The new components, ConfigKey, ConfigSet, ConfigManager, ConfiguredHelionKernel, and HelionKernelWrapper, are well-structured and include comprehensive unit tests covering various scenarios, including valid and invalid inputs, default values, and error handling. The design for dynamic batch-size-based kernel dispatching and compilation caching is clear. The __all__ declarations in the __init__.py files are correctly populated, and the ImportError checks for the helion dependency are properly implemented.
|
@ProExpertProg @zou3519 @xiaohongchen1991 Please take a look, this PR depends on #32740 |
|
For line 154-155 inside this is actually to cache the decorated kernel, i.e., the helion kernel decorated with "best" config found. I did some experiment before on the CPU overhead by comparing invoking the silu_and_mul kernel at different compilation stages. See the following results.
Mapping to your code, those kernels are from I understand those CPU overhead is nothing when graph capture is enabled. But may still be good to optimize it since |
|
|
||
| def __call__(self, *args, **kwargs): | ||
| """Execute the kernel with dynamic batch_size-based config selection.""" | ||
| # TODO(gmagogsfm): Validate this assumption. If it doesn't hold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this assumption will not hold for all kernels. Here is an example from an existing triton kernel used by LoRA feature.
| inputs: torch.Tensor, # shape [num_slices, num_tokens, lora_rank] |
We need a more generic solution here.
THIS IS A MANUALLY STACKED PR, PLEASE ONLY REVIEW TOP COMMIT, lower commits are being reviewed separately in an earlier PR.
This PR adds two basic cases to help Helion kernels with compilation and runtime dispatching.
HelionKernelWrapper would be constructed by vllm.helion.register() in following PRs. It is responsible for adding Helion kernel to registry and partially specify Helion kernel according to GPU platform and model config. As a result of specification, it would produce a ConfiguredHelionKernel, which is a callable registered as a PyTorch custom op.
The registered custom op contains batch-size-based runtime dispatching as well as actual Helion compilation logic. Upon invocation with dummy/real input data, it would compile and call helion kernels optimized for most fitting batch size. This dispatch decision can then be baked in via cudagraph capturing to completely eliminate overhead.