[torquedev] Jobs killed because of EOF between running pbs_mom's

Chris Samuel csamuel at vpac.org
Thu Jul 10 02:00:45 MDT 2008

Hi all,

Occasionally we get users running jobs that do silly
things like writing tens-of-gigabyte error files to
the users home directory, over NFS, from multiple nodes
and jobs at the same time.

At that point our NFS server slows down, and for some
reason we then also see pbs_mom starting to hit trouble.

Connections between the pbs_mom's start to time out, and
we get paralle jobs getting killed off with errors like:

PBS: job killed: node 7 (tango035) requested job terminate, 'EOF'
(code 1099) - internal or network failure attempting to communicate with
sister MOM's

In reality the job should be allowed to continue running.

During the time period of the network storm we had around
500-1000 jobs end, though it's hard to tell precisely how
many of those were caught by this problem.

Previously we were running with a patch that set the
pbs_tcp_timeout value to 60s (back up from its new default
of 20 seconds), but I'm thinking now of knocking it up
again to try and cope with this.

I'd *really* like a way to actually stop pbs_mom
from killing jobs across nodes at all when it gets
SISTER_EOF, and just retry..

Thoughts ?

Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency

More information about the torquedev mailing list