Skip to content

Error while using aggregate_moe_loss_stats #14692

@prathamk-tw

Description

@prathamk-tw

When I use aggregate_moe_loss_stats function then i get import error

Traceback (most recent call last):
File "/nemo_run/code/train_moe.py", line 141, in _log_moe_metrics
moe_loss_dict = aggregate_moe_loss_stats(loss_scale=1.0)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/NeMo/nemo/lightning/megatron_parallel.py", line 1833, in aggregate_moe_loss_stats
tracker = parallel_state.get_moe_layer_wise_logging_tracker()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: module 'megatron.core.parallel_state' has no attribute 'get_moe_layer_wise_logging_tracker'

This is present in nvcr.io/nvidia/nemo:25.07 & nvcr.io/nvidia/nemo:25.07.02

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions