A problem that only occurs on multi GPUs training, but it won't occur when only using one GPU #1371
Unanswered
AbrahamYabo
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, we met a problem that only occurs on multi GPUs training, but it won't occur when only using one GPU training when trying to train ResNetA3.
The program is crushed in the first epoch of validation.
We have tried several different discussed solutions such as:
The error is as follows:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
INFO:root:Added key: store_based_barrier_key:1 to store for rank: 1
INFO:root:Added key: store_based_barrier_key:1 to store for rank: 0
Added key: store_based_barrier_key:1 to store for rank: 1
Added key: store_based_barrier_key:1 to store for rank: 0
INFO:train:Training in distributed mode with multiple processes, 1 GPU per process. Process 1, total 2.
Training in distributed mode with multiple processes, 1 GPU per process. Process 1, total 2.
INFO:train:Training in distributed mode with multiple processes, 1 GPU per process. Process 0, total 2.
Training in distributed mode with multiple processes, 1 GPU per process. Process 0, total 2.
INFO:train:Model resnet50 created, param count:25557032
Model resnet50 created, param count:25557032
INFO:timm.data.config:Data processing configuration for current model + dataset:
Data processing configuration for current model + dataset:
INFO:timm.data.config: input_size: (3, 160, 160)
input_size: (3, 160, 160)
INFO:timm.data.config: interpolation: bicubic
interpolation: bicubic
INFO:timm.data.config: mean: (0.485, 0.456, 0.406)
mean: (0.485, 0.456, 0.406)
INFO:timm.data.config: std: (0.229, 0.224, 0.225)
std: (0.229, 0.224, 0.225)
INFO:timm.data.config: crop_pct: 1.0
crop_pct: 1.0
INFO:train:Using native Torch AMP. Training in mixed precision.
Using native Torch AMP. Training in mixed precision.
INFO:train:Using native Torch DistributedDataParallel.
Using native Torch DistributedDataParallel.
INFO:train:Scheduled epochs: 110
Scheduled epochs: 110
INFO:train:Train: 0 [ 0/25 ( 0%)] Loss: 0.6929 (0.693) Time: 4.092s, 125.12/s (4.092s, 125.12/s) LR: 1.000e-04 Data: 3.019 (3.019)
Train: 0 [ 0/25 ( 0%)] Loss: 0.6929 (0.693) Time: 4.092s, 125.12/s (4.092s, 125.12/s) LR: 1.000e-04 Data: 3.019 (3.019)
INFO:root:Reducer buckets have been rebuilt in this iteration.
Reducer buckets have been rebuilt in this iteration.
INFO:root:Reducer buckets have been rebuilt in this iteration.
Reducer buckets have been rebuilt in this iteration.
INFO:train:Train: 0 [ 24/25 (100%)] Loss: 0.6769 (0.685) Time: 0.144s, 3566.53/s (1.341s, 381.89/s) LR: 1.000e-04 Data: 0.000 (0.920)
Train: 0 [ 24/25 (100%)] Loss: 0.6769 (0.685) Time: 0.144s, 3566.53/s (1.341s, 381.89/s) LR: 1.000e-04 Data: 0.000 (0.920)
INFO:train:Distributing BatchNorm running means and vars
Distributing BatchNorm running means and vars
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: initialization error
Exception raised from insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:763 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f782c1722f2 in /home/ma-user/anaconda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f782c16f67b in /home/ma-user/anaconda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xc92 (0x7f782c3ca682 in /home/ma-user/anaconda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f782c15a3a4 in /home/ma-user/anaconda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #4: + 0x6e9f8a (0x7f7878f4af8a in /home/ma-user/anaconda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0x6ea031 (0x7f7878f4b031 in /home/ma-user/anaconda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x192a10 (0x55e344ae7a10 in /home/ma-user/anaconda/bin/python)
frame #7: + 0x13ad07 (0x55e344a8fd07 in /home/ma-user/anaconda/bin/python)
frame #8: + 0x13ae98 (0x55e344a8fe98 in /home/ma-user/anaconda/bin/python)
frame #9: + 0x13ae98 (0x55e344a8fe98 in /home/ma-user/anaconda/bin/python)
frame #10: + 0x13ae98 (0x55e344a8fe98 in /home/ma-user/anaconda/bin/python)
frame #11: + 0x13ae98 (0x55e344a8fe98 in /home/ma-user/anaconda/bin/python)
frame #12: + 0x13ae98 (0x55e344a8fe98 in /home/ma-user/anaconda/bin/python)
frame #13: + 0x13ae98 (0x55e344a8fe98 in /home/ma-user/anaconda/bin/python)
frame #14: + 0x13ae98 (0x55e344a8fe98 in /home/ma-user/anaconda/bin/python)
frame #15: + 0x13ae98 (0x55e344a8fe98 in /home/ma-user/anaconda/bin/python)
frame #16: + 0x13ae98 (0x55e344a8fe98 in /home/ma-user/anaconda/bin/python)
frame #17: + 0x13b5c8 (0x55e344a905c8 in /home/ma-user/anaconda/bin/python)
frame #18: + 0x13324e (0x55e344a8824e in /home/ma-user/anaconda/bin/python)
frame #19: + 0x163141 (0x55e344ab8141 in /home/ma-user/anaconda/bin/python)
frame #20: + 0xb07fd (0x55e344a057fd in /home/ma-user/anaconda/bin/python)
frame #21: PyTuple_New + 0xf1 (0x55e344ac03e1 in /home/ma-user/anaconda/bin/python)
frame #22: + 0x186e86 (0x55e344adbe86 in /home/ma-user/anaconda/bin/python)
frame #23: + 0x187382 (0x55e344adc382 in /home/ma-user/anaconda/bin/python)
frame #24: + 0x186f81 (0x55e344adbf81 in /home/ma-user/anaconda/bin/python)
frame #25: + 0x187356 (0x55e344adc356 in /home/ma-user/anaconda/bin/python)
frame #26: + 0x1a23c3 (0x55e344af73c3 in /home/ma-user/anaconda/bin/python)
frame #27: _PyMethodDef_RawFastCallKeywords + 0x8c (0x55e344acfc9c in /home/ma-user/anaconda/bin/python)
frame #28: + 0x191520 (0x55e344ae6520 in /home/ma-user/anaconda/bin/python)
frame #29: _PyEval_EvalFrameDefault + 0x3ec5 (0x55e344afd6d5 in /home/ma-user/anaconda/bin/python)
frame #30: _PyEval_EvalCodeWithName + 0x1cd (0x55e344a9b6ad in /home/ma-user/anaconda/bin/python)
frame #31: _PyFunction_FastCallKeywords + 0x491 (0x55e344ac9af1 in /home/ma-user/anaconda/bin/python)
frame #32: + 0x1913a5 (0x55e344ae63a5 in /home/ma-user/anaconda/bin/python)
frame #33: _PyEval_EvalFrameDefault + 0x1956 (0x55e344afb166 in /home/ma-user/anaconda/bin/python)
frame #34: _PyFunction_FastCallKeywords + 0xf8 (0x55e344ac9758 in /home/ma-user/anaconda/bin/python)
frame #35: + 0x1913a5 (0x55e344ae63a5 in /home/ma-user/anaconda/bin/python)
frame #36: _PyEval_EvalFrameDefault + 0x90a (0x55e344afa11a in /home/ma-user/anaconda/bin/python)
frame #37: _PyFunction_FastCallKeywords + 0xf8 (0x55e344ac9758 in /home/ma-user/anaconda/bin/python)
frame #38: + 0x1913a5 (0x55e344ae63a5 in /home/ma-user/anaconda/bin/python)
frame #39: _PyEval_EvalFrameDefault + 0x90a (0x55e344afa11a in /home/ma-user/anaconda/bin/python)
frame #40: _PyFunction_FastCallKeywords + 0xf8 (0x55e344ac9758 in /home/ma-user/anaconda/bin/python)
frame #41: + 0x1913a5 (0x55e344ae63a5 in /home/ma-user/anaconda/bin/python)
frame #42: _PyEval_EvalFrameDefault + 0x7c9 (0x55e344af9fd9 in /home/ma-user/anaconda/bin/python)
frame #43: _PyFunction_FastCallKeywords + 0xf8 (0x55e344ac9758 in /home/ma-user/anaconda/bin/python)
frame #44: + 0x1913a5 (0x55e344ae63a5 in /home/ma-user/anaconda/bin/python)
frame #45: _PyEval_EvalFrameDefault + 0x7c9 (0x55e344af9fd9 in /home/ma-user/anaconda/bin/python)
frame #46: _PyFunction_FastCallDict + 0x117 (0x55e344a9c447 in /home/ma-user/anaconda/bin/python)
frame #47: + 0x14afbd (0x55e344a9ffbd in /home/ma-user/anaconda/bin/python)
frame #48: _PyObject_CallMethodIdObjArgs + 0xc1 (0x55e344ab7541 in /home/ma-user/anaconda/bin/python)
frame #49: PyImport_ImportModuleLevelObject + 0x337 (0x55e344a95c67 in /home/ma-user/anaconda/bin/python)
frame #50: + 0x1a47e8 (0x55e344af97e8 in /home/ma-user/anaconda/bin/python)
frame #51: PyCFunction_Call + 0x5c (0x55e344a9ccfc in /home/ma-user/anaconda/bin/python)
frame #52: _PyEval_EvalFrameDefault + 0x1ca2 (0x55e344afb4b2 in /home/ma-user/anaconda/bin/python)
frame #53: _PyEval_EvalCodeWithName + 0x1cd (0x55e344a9b6ad in /home/ma-user/anaconda/bin/python)
frame #54: _PyFunction_FastCallKeywords + 0x3e0 (0x55e344ac9a40 in /home/ma-user/anaconda/bin/python)
frame #55: + 0x1913a5 (0x55e344ae63a5 in /home/ma-user/anaconda/bin/python)
frame #56: _PyEval_EvalFrameDefault + 0x7c9 (0x55e344af9fd9 in /home/ma-user/anaconda/bin/python)
frame #57: _PyEval_EvalCodeWithName + 0x1cd (0x55e344a9b6ad in /home/ma-user/anaconda/bin/python)
frame #58: _PyFunction_FastCallDict + 0x1ce (0x55e344a9c4fe in /home/ma-user/anaconda/bin/python)
frame #59: + 0x14afbd (0x55e344a9ffbd in /home/ma-user/anaconda/bin/python)
frame #60: _PyObject_CallMethodIdObjArgs + 0xc1 (0x55e344ab7541 in /home/ma-user/anaconda/bin/python)
frame #61: PyImport_ImportModuleLevelObject + 0x162 (0x55e344a95a92 in /home/ma-user/anaconda/bin/python)
frame #62: _PyEval_EvalFrameDefault + 0x1f92 (0x55e344afb7a2 in /home/ma-user/anaconda/bin/python)
frame #63: _PyFunction_FastCallKeywords + 0xf8 (0x55e344ac9758 in /home/ma-user/anaconda/bin/python)
Traceback (most recent call last):
File "/home/ma-user/anaconda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 986, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/ma-user/anaconda/lib/python3.7/multiprocessing/queues.py", line 104, in get
if not self._poll(timeout):
File "/home/ma-user/anaconda/lib/python3.7/multiprocessing/connection.py", line 257, in poll
return self._poll(timeout)
File "/home/ma-user/anaconda/lib/python3.7/multiprocessing/connection.py", line 414, in _poll
r = wait([self], timeout)
File "/home/ma-user/anaconda/lib/python3.7/multiprocessing/connection.py", line 921, in wait
ready = selector.select(timeout)
File "/home/ma-user/anaconda/lib/python3.7/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 3068) is killed by signal: Aborted.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "train.py", line 859, in
main()
File "train.py", line 656, in main
eval_metrics = validate(model, loader_eval, validate_loss_fn, args, amp_autocast=amp_autocast)
File "train.py", line 805, in validate
for batch_idx, (input, target) in enumerate(loader):
File "/home/ma-user/modelarts/user-job-dir/29timm-resnet-strike/timm/data/loader.py", line 103, in iter
for next_input, next_target in self.loader:
File "/home/ma-user/anaconda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 517, in next
data = self._next_data()
File "/home/ma-user/anaconda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1182, in _next_data
idx, data = self._get_data()
File "/home/ma-user/anaconda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1148, in _get_data
success, data = self._try_get_data()
File "/home/ma-user/anaconda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 999, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 3068) exited unexpectedly
Killing subprocess 2811
Killing subprocess 2812
Traceback (most recent call last):
File "/home/ma-user/anaconda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/ma-user/anaconda/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/torch/distributed/launch.py", line 340, in
main()
File "/home/ma-user/anaconda/lib/python3.7/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/home/ma-user/anaconda/lib/python3.7/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/ma-user/anaconda/bin/python', '-u', 'train.py', '--local_rank=1', '--batch-size', '256', '--epochs', '100', '--model', 'resnet50', '--lr', '0.008', '--aa', 'rand-m6-mstd0.5-inc1', '--bce-loss', '--bce-target-thresh', '0.2', '--cutmix', '1.0', '--img-size', '160', '--opt', 'fusedlamb', '--seed', '21', '--workers', '1', '--smoothing', '0.0', '--warmup-epochs', '5', '--weight-decay', '0.02', '--crop-pct', '1.0', '--data_dir', '/cache/imagenet/', '--amp', '--native-amp']' returned non-zero exit status 1.
Thanks for your help in advance!
Cheers, Yabo
Beta Was this translation helpful? Give feedback.
All reactions