Skip to content

Conversation

umfranci
Copy link
Collaborator

  • The verify_gpu_adapter_count test validates GPU counts by comparing outputs from lsvmbus, lspci, and nvidia-smi commands. However, it relies on a hardcoded list of GPU models and their device IDs to identify GPUs in the lsvmbus output.
  • This hardcoded approach fails when testing new GPU models, requiring manual code updates each time a new GPU hardware is released. This creates testing delays, maintenance overhead and increases failure percentage of the test.
  • Hence the aim here is to implement dynamic GPU detection to automatically identify new GPU models without manual intervention, while maintaining backward compatibility with existing GPU detection logic.
  • Suggested Fix:
    • Primary detection: Continue using the existing hardcoded GPU list for known models
    • Fallback mechanism: When no matches are found in the hardcoded list:
      • Group VMBus devices by their last segment (device ID suffix)
      • Identify GPU device groups where all entries are marked as "PCI Express pass-through"
      • Validate the count matches nvidia-smi output for accuracy
    • Direct counting: Added a new function to get GPU count directly from nvidia-smi command output, eliminating dependency on maintaining a hardcoded GPU model list


# If no matches in hardcoded list, group by last segment
self._log.debug("No GPUs found in hardcoded list, trying last-segment grouping")
gpu_count = self._get_gpu_count_by_last_segment(vmbus_devices)
Copy link
Member

@squirrelsc squirrelsc Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not direct get by this? why can it find more than above method?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The known list approach seems a more deterministic path for known SKUs; the grouping fallback only activates when the primary lookup yields zero - so it won’t over‑count or regress existing coverage. Grouping by last segment lets us automatically recognize newly released GPUs sharing a common encoded suffix without waiting for a manual list update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants