Skip to content

Inquiry about the status of H100 config #514

@GaoHengSiang

Description

@GaoHengSiang

Hello, thank you for this wonderful tool.

I was trying to simulate the cycle count of an H100 performing GEMM, however I only have access to an A40 GPU.
So I traced gemm (tencore and normal) from Deepbench_nvidia with the A40 GPU, then tried to run it with the H100-SASS config.

The H100 configs I gathered from these posts:

#344
accel-sim/gpgpu-sim_distribution#80

The simulation ultimately failed with SEGF, and I learned from #161 that it was to be expected.

The error log:

gemm_bench-inference_half_10_10_10_0_0--H100-SASS. Status=SEGF
Last 10 line of /home/gaohengsiang/accelsim/accel-sim-framework/util/job_launching/../../sim_run_12.0/gemm_bench/inference_half_10_10_10_0_0/H100-SASS/gemm_bench-inference_half_10_10_10_0_0.accelsim-commit-3c96d32_modified_0.0_25-11-25-22-09-15gpgpu-sim_git-commit-b18ee397_modified_0.0.o150
------------------
thread block = 0,0,1
GPGPU-Sim uArch: Shader 64 bind to kernel 7 '_ZN7cutlass6KernelI58cutlass_80_wmma_tensorop_h161616gemm_16x16_128x2_nn_align2EEvNT_6ParamsE'
thread block = 0,0,0
GPGPU-Sim: Reconfigure L1 cache to 96KB
GPGPU-Sim uArch: Shader 63 bind to kernel 7 '_ZN7cutlass6KernelI58cutlass_80_wmma_tensorop_h161616gemm_16x16_128x2_nn_align2EEvNT_6ParamsE'
launching kernel name: _ZN7cutlass6KernelI58cutlass_80_wmma_tensorop_h161616gemm_16x16_128x2_nn_align2EEvNT_6ParamsE uid: 7 cuda_stream_id: 0
Header info loaded for kernel command : ./traces/kernel-7-ctx_0x6245c0f86880.traceg.xz
-enable lineinfo = 0
-accelsim tracer version = 5
-nvbit version = 1.7.6
------------------

Contents of /home/gaohengsiang/accelsim/accel-sim-framework/util/job_launching/../../sim_run_12.0/gemm_bench/inference_half_10_10_10_0_0/H100-SASS/gemm_bench-inference_half_10_10_10_0_0.accelsim-commit-3c96d32_modified_0.0_25-11-25-22-09-15gpgpu-sim_git-commit-b18ee397_modified_0.0.e150
------------------
/home/gaohengsiang/accelsim/accel-sim-framework/util/job_launching/../../sim_run_12.0/gemm_bench/inference_half_10_10_10_0_0/H100-SASS/slurm.sim: line 54: 3984007 Segmentation fault (core dumped)
/home/gaohengsiang/accelsim/accel-sim-framework/util/job_launching/../../sim_run_12.0/gpgpu-sim-builds/accelsim-commit-3c96d32_modified_0.0_25-11-25-22-09-15gpgpu-sim_git-commit-b18ee397_modified_0.0/accel-sim.out -config ./gpgpusim.config -trace ./traces/kernelslist.g

All jobs seemed to SEGF at kernel-7, and mentioned something related to wmma.

This led me to two questions:

  1. Since the h100-test branch was closed, is the H100 config usable at all, or is it currently unsalvageable?
  2. Is there a way to approximate H100 performance by modifying the A100 config? (simulation with A100 config ran fine)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions