Skip to content

[BUG]: nvlink partition doesn't exist in NMX-M after successful creation by monitor #572

@tmcroberts97

Description

@tmcroberts97

Version

0.3.0

Which installation method(s) does this occur on?

No response

Describe the bug.

After the nvlink partition monitor successfully creates a partition via NMX-M (i.e. operation is "completed" when it is polled), the monitor pulls a fresh partition list from NMX-M and creates a DB record for the partition. In some cases, even though NMX-M indicated a successful partition creation, the fresh list does not contain the new partition. This leads the monitor to not store the partition in the DB.

It can take some time for the partition to appear in the list from NMX-M (in the case shown in the logs below it took over an hour). During this time, subsequent iterations of the monitor will attempt to put the GPUs in a partition, but NMX-M will complain that the GPUs are already in one. Eventually, the partition appears in NMX-M, and the monitor will complain that it doesn't know about the partition*.

*this last part is addressed in #471. Without this fix, the partition must be manually removed from NMX-M, which will allow the monitor to cleanly recreate the partition.

Minimum reproducible example

Relevant log output

Successful creation:
2026-03-13 11:48:55.869infolevel=INFO span_id=0x269772b79441614e msg="Operation Operation { id: \"<op-id>\", created_at: \"2026-03-13T11:48:54.136Z\", updated_at: \"2026-03-13T11:48:55.738Z\", status: Completed, percentage: 100.0, current_step: \"Done\", request: OperationRequest { method: Post, uri: \"/nmx/v1/partitions\", body: Some(Some(Object {\"members\": Array [String(\"<gpu-id0>\"), String(\"<gpu-id1>\"), String(\"<gpu-id2>\"), String(\"<gpu-id3>\")], \"name\": String(\"<partition-name>\")})), cancellable: true }, result: Some(OperationResult { data: Some(Some(String(\"<partition-id>\"))), error: Some(\"\"), details: Some(\"The requested Partition ID: <partition-id>\") }) } for logical partition <logical-partition-id> completed successfully" location="crates/api/src/nvl_partition_monitor/mod.rs:1686"

Error when adding to DB:
2026-03-13 11:48:57.895errorlevel=ERROR span_id=0x269772b79441614e msg="NMX-M partition not found for name <partition-name>" location="crates/api/src/nvl_partition_monitor/mod.rs:1850"

Error message when the monitor tries to recreate the partition, because NMX-C says the GPUs are already in a partition (indicated by error code 25):
result: Some(OperationResult { data: Some(None), error: Some(\"network-rejected\"), details: Some(\"Controller responded with: 25\") }

Other/Misc.

No response

Code of Conduct

  • I agree to follow NVIDIA Bare Metal Manager's Code of Conduct
  • I have searched the open bugs and have found no duplicates for this bug report

Metadata

Metadata

Assignees

Labels

bugA defect in existing software (deprecated - use issue type, but it's needed for reporting now)

Type

Projects

Status

In progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions