-
Notifications
You must be signed in to change notification settings - Fork 218
Fixing GPU Adapter Count test to be more dynamic and fail resistent #4038
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
umfranci
commented
Oct 10, 2025
- The verify_gpu_adapter_count test validates GPU counts by comparing outputs from lsvmbus, lspci, and nvidia-smi commands. However, it relies on a hardcoded list of GPU models and their device IDs to identify GPUs in the lsvmbus output.
- This hardcoded approach fails when testing new GPU models, requiring manual code updates each time a new GPU hardware is released. This creates testing delays, maintenance overhead and increases failure percentage of the test.
- Hence the aim here is to implement dynamic GPU detection to automatically identify new GPU models without manual intervention, while maintaining backward compatibility with existing GPU detection logic.
- Suggested Fix:
- Primary detection: Continue using the existing hardcoded GPU list for known models
- Fallback mechanism: When no matches are found in the hardcoded list:
- Group VMBus devices by their last segment (device ID suffix)
- Identify GPU device groups where all entries are marked as "PCI Express pass-through"
- Validate the count matches nvidia-smi output for accuracy
- Direct counting: Added a new function to get GPU count directly from nvidia-smi command output, eliminating dependency on maintaining a hardcoded GPU model list
lisa/features/gpu.py
Outdated
|
||
# If no matches in hardcoded list, group by last segment | ||
self._log.debug("No GPUs found in hardcoded list, trying last-segment grouping") | ||
gpu_count = self._get_gpu_count_by_last_segment(vmbus_devices) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not direct get by this? why can it find more than above method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The known list approach seems a more deterministic path for known SKUs; the grouping fallback only activates when the primary lookup yields zero - so it won’t over‑count or regress existing coverage. Grouping by last segment lets us automatically recognize newly released GPUs sharing a common encoded suffix without waiting for a manual list update.