error on using parallel gpu #404

sanazss · 2019-07-29T09:51:48Z

Hi. I am using your very last version and get this erro after one epoch. any hint on what is the error

self.reducer.prepare_for_backward([])

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing its output (the return value of forward). You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel. If you already have this argument set, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:408)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x2b9d1c82b441 in /redresearch/ssalati/venv/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x2b9d1c82ad7a in /redresearch/ssalati/venv/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #2: c10d::Reducer::prepare_for_backward(std::vector<torch::autograd::Variable, std::allocatortorch::autograd::Variable > const&) + 0x5ec (0x2b9cdd08483c in /redresearch/ssalati/venv/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #3: + 0x6c52bd (0x2b9cdd07a2bd in /redresearch/ssalati/venv/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x130cfc (0x2b9cdcae5cfc in /redresearch/ssalati/venv/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

frame #31: __libc_start_main + 0xf1 (0x2b9cb531c2e1 in /lib/x86_64-linux-gnu/libc.so.6)

it tells me to put a flag of find_unused_parameters=Trueintorch.nn.parallel.DistributedDataParallel`.

would you kindly let me know of the problem.thanks

The text was updated successfully, but these errors were encountered:

jasdeep06 · 2019-07-31T18:39:46Z

I can confirm that this problem goes away if you use number of examples in training set as a multiple of accumulate X batch_size.Eg - If batch size is 16 and accumulate is 4 then your training set length should be a multiple of 16*4=64.
Or else use single GPU.

vincentwei0919 · 2019-08-02T08:08:31Z

I have encountered this error too，so have you got a solution on this now?

glenn-jocher · 2019-08-02T14:23:04Z

All, it goes without saying that your batch_size should always be divisible by the number of GPUs. So for example batch_size 10 with 4 GPUs is not ok.

Multi-GPU should be ok as long as you follow this rule. Default COCO train set has 117,263 images and trains (and tests) fine on a GCP instance with 1, 2, 4 or 8 GPUs under the default settings.

turboxin · 2019-08-08T03:34:00Z

All, it goes without saying that your batch_size should always be divisible by the number of GPUs. So for example batch_size 10 with 4 GPUs is not ok.

Multi-GPU should be ok as long as you follow this rule. Default COCO train set has 117,263 images and trains (and tests) fine on a GCP instance with 1, 2, 4 or 8 GPUs under the default settings.

@glenn-jocher trained with batch_size of 14 and 2 gpu, still got this error

glenn-jocher · 2019-08-08T07:26:34Z

@turboxin
Hello, thank you for your interest in our work! This is an automated response. Please note that most technical problems are due to:

Your changes to the default repository. If your issue is not reproducible in a fresh git clone version of this repository we can not debug it. Before going further run this code and ensure your issue persists:

sudo rm -rf yolov3  # remove exising repo
git clone https://github.com/ultralytics/yolov3 && cd yolov3 # git clone latest
python3 detect.py  # verify detection
python3 train.py  # verify training (a few batches only)
# CODE TO REPRODUCE YOUR ISSUE HERE

Your custom data. If your issue is not reproducible with COCO data we can not debug it. Visit our Custom Training Tutorial for exact details on how to format your custom data. Examine train_batch0.jpg and test_batch0.jpg for a sanity check of training and testing data.
Your environment. If your issue is not reproducible in a GCP Quickstart Guide VM we can not debug it. Ensure you meet the requirements specified in the README: Unix, MacOS, or Windows with Python >= 3.7, Pytorch >= 1.1, etc. You can also use our Google Colab Notebook to test your code in working environment.

If none of these apply to you, we suggest you close this issue and raise a new one using the Bug Report template, providing screenshots and minimum viable code to reproduce your issue. Thank you!

TOMLEUNGKS · 2019-09-16T02:54:05Z

Hi. I am using your very last version and get this erro after one epoch. any hint on what is the error
self.reducer.prepare_for_backward([])
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing its output (the return value of forward). You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel. If you already have this argument set, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:408)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x2b9d1c82b441 in /redresearch/ssalati/venv/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x2b9d1c82ad7a in /redresearch/ssalati/venv/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #2: c10d::Reducer::prepare_for_backward(std::vector<torch::autograd::Variable, std::allocatortorch::autograd::Variable > const&) + 0x5ec (0x2b9cdd08483c in /redresearch/ssalati/venv/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #3: + 0x6c52bd (0x2b9cdd07a2bd in /redresearch/ssalati/venv/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x130cfc (0x2b9cdcae5cfc in /redresearch/ssalati/venv/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

frame #31: __libc_start_main + 0xf1 (0x2b9cb531c2e1 in /lib/x86_64-linux-gnu/libc.so.6)

it tells me to put a flag of find_unused_parameters=Trueintorch.nn.parallel.DistributedDataParallel`.

would you kindly let me know of the problem.thanks

@sanazss I am encountering the same error, even the batch size is the multiple of GPUs. May I know how you solved this problem at the end? Thanks so much!

yiningzeng · 2019-11-12T08:25:09Z

Hi. I am using your very last version and get this erro after one epoch. any hint on what is the error
self.reducer.prepare_for_backward([])
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing its output (the return value of forward). You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel. If you already have this argument set, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:408)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x2b9d1c82b441 in /redresearch/ssalati/venv/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x2b9d1c82ad7a in /redresearch/ssalati/venv/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #2: c10d::Reducer::prepare_for_backward(std::vector<torch::autograd::Variable, std::allocatortorch::autograd::Variable > const&) + 0x5ec (0x2b9cdd08483c in /redresearch/ssalati/venv/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #3: + 0x6c52bd (0x2b9cdd07a2bd in /redresearch/ssalati/venv/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x130cfc (0x2b9cdcae5cfc in /redresearch/ssalati/venv/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #31: __libc_start_main + 0xf1 (0x2b9cb531c2e1 in /lib/x86_64-linux-gnu/libc.so.6)
it tells me to put a flag of find_unused_parameters=Trueintorch.nn.parallel.DistributedDataParallel`.
would you kindly let me know of the problem.thanks
@sanazss I am encountering the same error, even the batch size is the multiple of GPUs. May I know how you solved this problem at the end? Thanks so much!

Both batch_size and number of train.txt rows should always be divisible by the number of GPUs

sanazss added the enhancement New feature or request label Jul 29, 2019

glenn-jocher closed this as completed Aug 2, 2019

Amazingren mentioned this issue Apr 10, 2020

./train.sh for TSM stop at the first log infor. : Freezing BatchNorm2D except... Sense-X/X-Temporal#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error on using parallel gpu #404

error on using parallel gpu #404

sanazss commented Jul 29, 2019

jasdeep06 commented Jul 31, 2019 •

edited

Loading

vincentwei0919 commented Aug 2, 2019

glenn-jocher commented Aug 2, 2019

turboxin commented Aug 8, 2019

glenn-jocher commented Aug 8, 2019 •

edited

Loading

TOMLEUNGKS commented Sep 16, 2019

yiningzeng commented Nov 12, 2019

error on using parallel gpu #404

error on using parallel gpu #404

Comments

sanazss commented Jul 29, 2019

jasdeep06 commented Jul 31, 2019 • edited Loading

vincentwei0919 commented Aug 2, 2019

glenn-jocher commented Aug 2, 2019

turboxin commented Aug 8, 2019

glenn-jocher commented Aug 8, 2019 • edited Loading

TOMLEUNGKS commented Sep 16, 2019

yiningzeng commented Nov 12, 2019

jasdeep06 commented Jul 31, 2019 •

edited

Loading

glenn-jocher commented Aug 8, 2019 •

edited

Loading