RuntimeError: Cuda error: 9[cudaLaunchKernel(func_tbl[func_idx], gridSize, blockSize, args, 0, stream);]