We need to improve performance of MOE GEMM kernels including FP8 case https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/fused_batched_moe.py#L356