Skip to content

Conversation

@gmagogsfm
Copy link
Contributor

@gmagogsfm gmagogsfm commented Jan 23, 2026

THIS IS A MANUALLY STACKED PR, PLEASE ONLY REVIEW TOP COMMIT, lower commits are being reviewed separately in an earlier PR.

This PR adds two basic cases to help Helion kernels with compilation and runtime dispatching.

  • HelionKernelWrapper would be constructed by vllm.helion.register() in following PRs. It is responsible for adding Helion kernel to registry and partially specify Helion kernel according to GPU platform and model config. As a result of specification, it would produce a ConfiguredHelionKernel, which is a callable registered as a PyTorch custom op.

  • The registered custom op contains batch-size-based runtime dispatching as well as actual Helion compilation logic. Upon invocation with dummy/real input data, it would compile and call helion kernels optimized for most fitting batch size. This dispatch decision can then be baked in via cudagraph capturing to completely eliminate overhead.

Signed-off-by: Yanan Cao <gmagogsfm@gmail.com>
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com>
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com>
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com>
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces the Helion kernel wrapper, configuration management, and utility functions. The new components, ConfigKey, ConfigSet, ConfigManager, ConfiguredHelionKernel, and HelionKernelWrapper, are well-structured and include comprehensive unit tests covering various scenarios, including valid and invalid inputs, default values, and error handling. The design for dynamic batch-size-based kernel dispatching and compilation caching is clear. The __all__ declarations in the __init__.py files are correctly populated, and the ImportError checks for the helion dependency are properly implemented.

@gmagogsfm
Copy link
Contributor Author

@ProExpertProg @zou3519 @xiaohongchen1991 Please take a look, this PR depends on #32740

@xiaohongchen1991
Copy link
Contributor

For line 154-155 inside ConfiguredHelionKernel._get_compiled_kernel,

compiled_kernel = helion.kernel(**kernel_kwargs)(self.wrapper.raw_kernel_func)
self._compiled_kernels[config_hash] = compiled_kernel

this is actually to cache the decorated kernel, i.e., the helion kernel decorated with "best" config found.

I did some experiment before on the CPU overhead by comparing invoking the silu_and_mul kernel at different compilation stages. See the following results.

Kernel decorated_silu_mul bound_silu_mul compiled_silu_mul code_gen_triton_kernel
Latency (ms) 0.050 0.044 0.038 0.033

Mapping to your code, those kernels are from

decorated_silu_mul = helion.kernel(**kernel_kwargs)(silu_mul)
bound_silu_mul = decorated_silu_mul.bind((*args, **kwargs))
compiled_silu_mul = bound_silu_mul.compile_config(config)

I understand those CPU overhead is nothing when graph capture is enabled. But may still be good to optimize it since ConfiguredHelionKernel._get_compiled_kernel is called by ConfiguredHelionKernel.__call__ where we have the inputs to create the real compiled_kernel and cache it directly.


def __call__(self, *args, **kwargs):
"""Execute the kernel with dynamic batch_size-based config selection."""
# TODO(gmagogsfm): Validate this assumption. If it doesn't hold
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this assumption will not hold for all kernels. Here is an example from an existing triton kernel used by LoRA feature.

inputs: torch.Tensor, # shape [num_slices, num_tokens, lora_rank]
.

We need a more generic solution here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants