[torquedev] Should a communication error between pbs_mom's kill a job ?
csamuel at vpac.org
Mon May 4 06:35:37 MDT 2009
We're starting to get complaints from a cluster we're
helping out on where spurious communication faults between
pbs_mom's are causing jobs to get killed off, even though
they are continuing to function properly. For example:
=>> PBS: job killed: node 11 (shrek026) requested job terminate
, 'EOF' (code 1099) - internal or network failure attempting to
communicate with sister MOM's
This usually seems to happen when nodes are under high
I/O load (mostly NFS related) and doesn't appear to be
related to any actual issues on the nodes.
To me the simplest solution for this case appears to be
to just return 0 from the job_over_limit() function when
it's in that condition, rather than letting it fall through
to the return(1).
Thoughts about the general principle ?
Christopher Samuel - (03) 9925 4751 - Systems Manager
The Victorian Partnership for Advanced Computing
P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency
More information about the torquedev