Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enh: account for timeouts during job status checks #2767

Merged
merged 5 commits into from
Nov 7, 2018

Conversation

mgxd
Copy link
Member

@mgxd mgxd commented Oct 30, 2018

Summary

Fixes #2766 .

List of changes proposed in this PR (pull-request)

  • Avoid raising exception when checking job status if select timeout messages appear.

Acknowledgment

  • (Mandatory) I acknowledge that this contribution will be available under the Apache 2 license.

cc @agt24 - could you test out this branch and see if you still run into problems?

@@ -70,6 +70,10 @@ def _is_pending(self, taskid):
terminal_output='allatonce').run()
return res.runtime.stdout.find(str(taskid)) > -1
except RuntimeError as e:
if any(ss in str(e) for ss
in ['Socket timed out', 'not available at the moment']):
# do not raise error and allow recheck
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be a good idea to add a log entry for this case.

@effigies effigies added this to the 1.1.5 milestone Oct 31, 2018
@codecov-io
Copy link

Codecov Report

Merging #2767 into master will decrease coverage by 4.7%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #2767      +/-   ##
=========================================
- Coverage    67.5%   62.8%   -4.71%     
=========================================
  Files         340     337       -3     
  Lines       43238   43105     -133     
  Branches     5362    5343      -19     
=========================================
- Hits        29187   27071    -2116     
- Misses      13351   14985    +1634     
- Partials      700    1049     +349
Flag Coverage Δ
#smoketests ?
#unittests 62.8% <ø> (-2.09%) ⬇️
Impacted Files Coverage Δ
nipype/pipeline/plugins/semaphore_singleton.py 0% <0%> (-100%) ⬇️
nipype/pipeline/plugins/linear.py 22.85% <0%> (-65.72%) ⬇️
nipype/pipeline/plugins/pbsgraph.py 29.41% <0%> (-64.71%) ⬇️
nipype/pipeline/plugins/multiproc.py 20.25% <0%> (-64.56%) ⬇️
nipype/pipeline/plugins/tools.py 21.91% <0%> (-63.02%) ⬇️
nipype/interfaces/nilearn.py 40% <0%> (-56.67%) ⬇️
nipype/utils/spm_docs.py 25.92% <0%> (-44.45%) ⬇️
nipype/pipeline/plugins/debug.py 43.75% <0%> (-43.75%) ⬇️
nipype/pipeline/plugins/legacymultiproc.py 19.41% <0%> (-42.36%) ⬇️
nipype/pipeline/plugins/oar.py 20.45% <0%> (-34.1%) ⬇️
... and 52 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 50862c1...912dde8. Read the comment docs.

@effigies
Copy link
Member

effigies commented Nov 1, 2018

Merge master to fix tests?

@agt24
Copy link

agt24 commented Nov 1, 2018

Thanks for addressing this so quickly. @dalejn is going to test and report back.

@effigies
Copy link
Member

effigies commented Nov 7, 2018

@dalejn Any updates?

@dalejn
Copy link
Contributor

dalejn commented Nov 7, 2018

Yes, thank you for getting to this so promptly!
It seems to be working as intended--rather than crashing, I now receive the logger warning:
181107-12:32:07,621 nipype.workflow WARNING:
SLURM timeout encountered while checking job status, treating job 12849880 as pending

@mgxd
Copy link
Member Author

mgxd commented Nov 7, 2018

great - let us know if anything come up

@mgxd mgxd merged commit cfb5aba into nipy:master Nov 7, 2018
@effigies effigies mentioned this pull request Nov 7, 2018
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Req to deal with SLURM socket errors more patiently
5 participants