Skip to content

[Userbenchmark] Failed to execute userbenchmark/distributed because Accelerate object has no attribute 'use_fp16' #2593

@aztecher

Description

@aztecher

Hi, there!
This is the first time reporting issues to this repo, so please give me some advice! ;)

I found the error when executing userbenchmark/distributed.

$ python run_benchmark.py distributed \
  --ngpus 1 \
  --nodes 1  \
  --model torchbenchmark.e2e_models.hf_bert.Model \
  --trainer torchbenchmark.util.distributed.trainer.Trainer \
  --distributed ddp \
  --job_dir $PWD/.userbenchmark/distributed/e2e_hf_bert \
  --profiler False
/home/aztecher/benchmark/.userbenchmark/distributed/e2e_hf_bert/ad50a4d731e440c6ac57b8122b2143ce_init
Traceback (most recent call last):
  File "/home/aztecher/benchmark/run_benchmark.py", line 48, in <module>
    run()
  File "/home/aztecher/benchmark/run_benchmark.py", line 41, in run
    benchmark.run(bm_args)
  File "/home/aztecher/benchmark/userbenchmark/distributed/run.py", line 28, in run
    result = slurm_run(args, model_args)
  File "/home/aztecher/benchmark/userbenchmark/distributed/run.py", line 92, in slurm_run
    result = job.results()
  File "/home/aztecher/bench/lib64/python3.9/site-packages/submitit/core/core.py", line 294, in results
    raise job_exception  # pylint: disable=raising-bad-type
submitit.core.utils.FailedJobError: Job (task=0) failed during processing with trace:
----------------------
Traceback (most recent call last):
  File "/home/aztecher/bench/lib64/python3.9/site-packages/submitit/core/submission.py", line 55, in process_job
    result = delayed.result()
  File "/home/aztecher/bench/lib64/python3.9/site-packages/submitit/core/utils.py", line 137, in result
    self._result = self.function(*self.args, **self.kwargs)
  File "/home/aztecher/benchmark/torchbenchmark/util/distributed/submit.py", line 134, in __call__
    return trainer_class(
  File "/home/aztecher/benchmark/torchbenchmark/util/distributed/trainer.py", line 33, in __init__
    self.e2e_benchmark: E2EBenchmarkModel = model_class(
  File "/home/aztecher/benchmark/torchbenchmark/util/e2emodel.py", line 9, in __call__
    obj = type.__call__(cls, *args, **kwargs)
  File "/home/aztecher/benchmark/torchbenchmark/e2e_models/hf_bert/__init__.py", line 114, in __init__
    self.prep(hf_args)
  File "/home/aztecher/benchmark/torchbenchmark/e2e_models/hf_bert/__init__.py", line 175, in prep
    tokenizer, pad_to_multiple_of=(8 if accelerator.use_fp16 else None)
AttributeError: 'Accelerator' object has no attribute 'use_fp16'

When I checked the implementation of huggingface/accelerate, I found that Accelerator class doesn't already have the attribute use_fp16(huggingface/accelerate#3098).
And since the version of the accelerate module is not fixed in requirements.txt currently, if you try to do this benchmark, you will face the same issue.

I guess that using accelerator.state.mixed_precision will be an alternation of using that property, like bellow.

            self.data_collator = DataCollatorWithPadding(
                tokenizer, pad_to_multiple_of=(8 if accelerator.state.mixed_precision == "fp16" else None)

And in my environment, this will work fine.

What do you think of this implementation? If ok, I will create PR to fix this issue.
Thanks.

How to reproduce

  • OS: Rocky Linux 9.5
  • Python: 3.9.21
  • (NVIDIA Driver version: 560.28.03)
  • (CUDA version: 12.6)
  • (Slurm version: 24.05.3-1)
# Suppose this node is already configured as a slurm worker.

# Setup env
$ python -m venv bench
$ source bench/bin/activate

# Clone pytorch/benchmark and install requirements
(bench) $ git clone https://github.com/pytorch/benchmark.git
(bench) $ cd benchmark
(bench) $ pip install -r requirements.txt
(bench) $ pip install -r torchbenchmark/e2e_models/hf_bert/requirements.txt


# Tool / module versions
(bench) $ python --version
Python 3.9.21
(bench) $ pip --version
pip 21.3.1 from /home/mmichish/bench/lib64/python3.9/site-packages/pip (python 3.9)
(bench) $ pip list |grep accelerate
accelerate               1.3.0

# Run benchmark
(bench) $ python run_benchmark.py distributed \
  --ngpus 1 \
  --nodes 1  \
  --model torchbenchmark.e2e_models.hf_bert.Model \
  --trainer torchbenchmark.util.distributed.trainer.Trainer \
  --distributed ddp \
  --job_dir $PWD/.userbenchmark/distributed/e2e_hf_bert \
  --profiler False

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions