Fixing GPU Adapter Count test to be more dynamic and fail resistent #4038

umfranci · 2025-10-10T06:17:19Z

The verify_gpu_adapter_count test validates GPU counts by comparing outputs from lsvmbus, lspci, and nvidia-smi commands. However, it relies on a hardcoded list of GPU models and their device IDs to identify GPUs in the lsvmbus output.
This hardcoded approach fails when testing new GPU models, requiring manual code updates each time a new GPU hardware is released. This creates testing delays, maintenance overhead and increases failure percentage of the test.
Hence the aim here is to implement dynamic GPU detection to automatically identify new GPU models without manual intervention, while maintaining backward compatibility with existing GPU detection logic.
Suggested Fix:
- Primary detection: Continue using the existing hardcoded GPU list for known models
- Fallback mechanism: When no matches are found in the hardcoded list:
  - Group VMBus devices by their last segment (device ID suffix)
  - Identify GPU device groups where all entries are marked as "PCI Express pass-through"
  - Validate the count matches nvidia-smi output for accuracy
- Direct counting: Added a new function to get GPU count directly from nvidia-smi command output, eliminating dependency on maintaining a hardcoded GPU model list

…ain the GPU

lisa/features/gpu.py

lisa/tools/nvidiasmi.py

lisa/features/gpu.py

squirrelsc · 2025-10-10T20:37:51Z

lisa/features/gpu.py

+
+        # If no matches in hardcoded list, group by last segment
+        self._log.debug("No GPUs found in hardcoded list, trying last-segment grouping")
+        gpu_count = self._get_gpu_count_by_last_segment(vmbus_devices)


Why not direct get by this? why can it find more than above method?

The known list approach seems a more deterministic path for known SKUs; the grouping fallback only activates when the primary lookup yields zero - so it won’t over‑count or regress existing coverage. Grouping by last segment lets us automatically recognize newly released GPUs sharing a common encoded suffix without waiting for a manual list update.

umfranci added 6 commits October 7, 2025 18:06

Adding code to get GPU Count dynamically

01b3d6c

determining GPU count based on grouping if initial list does not cont…

57121bd

…ain the GPU

removing GB200 from known device list

1750183

adding new function to return raw nvidia-smi gpu count

c2aaf45

removing A10-4Q for testing

7ce447f

changes to fix flake8 and use nvidia-smi without predefined list

51351e7

squirrelsc reviewed Oct 10, 2025

View reviewed changes

lisa/features/gpu.py Outdated Show resolved Hide resolved

squirrelsc reviewed Oct 10, 2025

View reviewed changes

lisa/tools/nvidiasmi.py Outdated Show resolved Hide resolved

squirrelsc reviewed Oct 10, 2025

View reviewed changes

lisa/features/gpu.py Outdated Show resolved Hide resolved

squirrelsc reviewed Oct 10, 2025

View reviewed changes

changes to address comments and provide better naming

69c3735

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixing GPU Adapter Count test to be more dynamic and fail resistent #4038

Fixing GPU Adapter Count test to be more dynamic and fail resistent #4038

umfranci commented Oct 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

squirrelsc Oct 10, 2025 •

edited

Loading

Uh oh!

umfranci Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fixing GPU Adapter Count test to be more dynamic and fail resistent #4038

Are you sure you want to change the base?

Fixing GPU Adapter Count test to be more dynamic and fail resistent #4038

Conversation

umfranci commented Oct 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

squirrelsc Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

umfranci Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

squirrelsc Oct 10, 2025 •

edited

Loading