-
Notifications
You must be signed in to change notification settings - Fork 511
Open
Description
Description
This feature adds GPU-level telemetry collection inside VMs by leveraging the QEMU Guest Agent to run nvidia-smi. It enables OpenNebula to monitor critical GPU metrics such as utilization, memory usage, temperature, and power consumption.
Use case
Users running AI/ML workloads in OpenNebula need visibility into GPU performance and health to ensure efficient scheduling, workload balancing, and troubleshooting.
Interface Changes
CLI: GPU metrics will be included in onevm show.
(Pending) Sunstone: GPU monitoring panel/tab in the VM view.
Metrics collection will skip VMs without assigned NVIDIA devices.
Additional Context
Default metrics gathered:
- gpu_count – Number of GPUs
- utilization.gpu – GPU core usage (%)
- utilization.memory – Memory bandwidth utilization (%)
- memory.free – Free GPU memory (MiB)
- power.draw – Power draw (Watts)
Progress Status
- Code committed
- Testing - QA
- Documentation (Release notes - resolved issues, compatibility, known issues)