Skip to content

貌似torch.autocast和deepspeed不能直接融合,运行实例会报错 #2788

@yangyyt

Description

@yangyyt

Describe the bug
A clear and concise description of what the bug is.
使用deepspeed训练examples/aishell/whisper,会报错:

python3.10/site-packages/deepspeed/runtime/torch_autocast.py", line 97, in validate_nested_autocast
raise AssertionError(
AssertionError: torch.autocast is enabled outside DeepSpeed, but not in the DeepSpeed config. Please enable torch.autocast through the DeepSpeed config to ensure the correct communication dtype is used.

修改batch_forward函数,去掉with autocast, 可以运行,但是出现数据类型问题:
ch/nn/modules/conv.py", line 370, in _conv_forward
return F.conv1d(
RuntimeError: Input type (float) and bias type (c10::BFloat16) should be the same

手动改了输入类型bf16,输出loss会有新的问题
RuntimeError: "ctc_loss_cuda" not implemented for 'BFloat16'

torch version: 2.6.0
deepspeed: 0.17.5

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions