-
Notifications
You must be signed in to change notification settings - Fork 315
Open
Description
Hi, there!
This is the first time reporting issues to this repo, so please give me some advice! ;)
I found the error when executing userbenchmark/distributed.
$ python run_benchmark.py distributed \
--ngpus 1 \
--nodes 1 \
--model torchbenchmark.e2e_models.hf_bert.Model \
--trainer torchbenchmark.util.distributed.trainer.Trainer \
--distributed ddp \
--job_dir $PWD/.userbenchmark/distributed/e2e_hf_bert \
--profiler False
/home/aztecher/benchmark/.userbenchmark/distributed/e2e_hf_bert/ad50a4d731e440c6ac57b8122b2143ce_init
Traceback (most recent call last):
File "/home/aztecher/benchmark/run_benchmark.py", line 48, in <module>
run()
File "/home/aztecher/benchmark/run_benchmark.py", line 41, in run
benchmark.run(bm_args)
File "/home/aztecher/benchmark/userbenchmark/distributed/run.py", line 28, in run
result = slurm_run(args, model_args)
File "/home/aztecher/benchmark/userbenchmark/distributed/run.py", line 92, in slurm_run
result = job.results()
File "/home/aztecher/bench/lib64/python3.9/site-packages/submitit/core/core.py", line 294, in results
raise job_exception # pylint: disable=raising-bad-type
submitit.core.utils.FailedJobError: Job (task=0) failed during processing with trace:
----------------------
Traceback (most recent call last):
File "/home/aztecher/bench/lib64/python3.9/site-packages/submitit/core/submission.py", line 55, in process_job
result = delayed.result()
File "/home/aztecher/bench/lib64/python3.9/site-packages/submitit/core/utils.py", line 137, in result
self._result = self.function(*self.args, **self.kwargs)
File "/home/aztecher/benchmark/torchbenchmark/util/distributed/submit.py", line 134, in __call__
return trainer_class(
File "/home/aztecher/benchmark/torchbenchmark/util/distributed/trainer.py", line 33, in __init__
self.e2e_benchmark: E2EBenchmarkModel = model_class(
File "/home/aztecher/benchmark/torchbenchmark/util/e2emodel.py", line 9, in __call__
obj = type.__call__(cls, *args, **kwargs)
File "/home/aztecher/benchmark/torchbenchmark/e2e_models/hf_bert/__init__.py", line 114, in __init__
self.prep(hf_args)
File "/home/aztecher/benchmark/torchbenchmark/e2e_models/hf_bert/__init__.py", line 175, in prep
tokenizer, pad_to_multiple_of=(8 if accelerator.use_fp16 else None)
AttributeError: 'Accelerator' object has no attribute 'use_fp16'
When I checked the implementation of huggingface/accelerate, I found that Accelerator class doesn't already have the attribute use_fp16
(huggingface/accelerate#3098).
And since the version of the accelerate
module is not fixed in requirements.txt currently, if you try to do this benchmark, you will face the same issue.
I guess that using accelerator.state.mixed_precision will be an alternation of using that property, like bellow.
self.data_collator = DataCollatorWithPadding(
tokenizer, pad_to_multiple_of=(8 if accelerator.state.mixed_precision == "fp16" else None)
And in my environment, this will work fine.
What do you think of this implementation? If ok, I will create PR to fix this issue.
Thanks.
How to reproduce
- OS: Rocky Linux 9.5
- Python: 3.9.21
- (NVIDIA Driver version: 560.28.03)
- (CUDA version: 12.6)
- (Slurm version: 24.05.3-1)
# Suppose this node is already configured as a slurm worker.
# Setup env
$ python -m venv bench
$ source bench/bin/activate
# Clone pytorch/benchmark and install requirements
(bench) $ git clone https://github.com/pytorch/benchmark.git
(bench) $ cd benchmark
(bench) $ pip install -r requirements.txt
(bench) $ pip install -r torchbenchmark/e2e_models/hf_bert/requirements.txt
# Tool / module versions
(bench) $ python --version
Python 3.9.21
(bench) $ pip --version
pip 21.3.1 from /home/mmichish/bench/lib64/python3.9/site-packages/pip (python 3.9)
(bench) $ pip list |grep accelerate
accelerate 1.3.0
# Run benchmark
(bench) $ python run_benchmark.py distributed \
--ngpus 1 \
--nodes 1 \
--model torchbenchmark.e2e_models.hf_bert.Model \
--trainer torchbenchmark.util.distributed.trainer.Trainer \
--distributed ddp \
--job_dir $PWD/.userbenchmark/distributed/e2e_hf_bert \
--profiler False
Metadata
Metadata
Assignees
Labels
No labels