[torquedev] Jobs killed because of EOF between running pbs_mom's
Toni L. Harbaugh-Blackford [Contr]
harbaugh at ncifcrf.gov
Thu Jul 10 05:06:30 MDT 2008
On Thu, 10 Jul 2008, Chris Samuel wrote:
> Hi all,
> Occasionally we get users running jobs that do silly
> things like writing tens-of-gigabyte error files to
> the users home directory, over NFS, from multiple nodes
> and jobs at the same time.
> At that point our NFS server slows down, and for some
> reason we then also see pbs_mom starting to hit trouble.
> Connections between the pbs_mom's start to time out, and
> we get paralle jobs getting killed off with errors like:
> PBS: job killed: node 7 (tango035) requested job terminate, 'EOF'
> (code 1099) - internal or network failure attempting to communicate with
> sister MOM's
> In reality the job should be allowed to continue running.
> During the time period of the network storm we had around
> 500-1000 jobs end, though it's hard to tell precisely how
> many of those were caught by this problem.
> Previously we were running with a patch that set the
> pbs_tcp_timeout value to 60s (back up from its new default
> of 20 seconds), but I'm thinking now of knocking it up
> again to try and cope with this.
> I'd *really* like a way to actually stop pbs_mom
> from killing jobs across nodes at all when it gets
> SISTER_EOF, and just retry..
> Thoughts ?
Yes, I think retry would be better too, with a descriptive
log entry (including the job id) being made for at least the first
> Christopher Samuel - (03) 9925 4751 - Systems Manager
> The Victorian Partnership for Advanced Computing
> P.O. Box 201, Carlton South, VIC 3053, Australia
> VPAC is a not-for-profit Registered Research Agency
> torquedev mailing list
> torquedev at supercluster.org
Toni Harbaugh-Blackford, [Contractor]
System Administrator, Advanced Biomedical Computing Center
Advanced Technology Program, Bldg. 430/122
SAIC-Frederick, Inc. / NCI at Frederick
P.O. Box B, Frederick, MD 21702
Phone: 301/846-5798 Fax: 301/846-5762
Email: harbaugh at ncifcrf.gov
NOTICE: This communication may contain privileged or other confidential
information. If you are not the intended recipient, or believe that you
have received this communication in error, please do not print, copy,
retransmit, disseminate or otherwise use the information. Please indicate
to the sender that you have received this email in error and delete the copy
More information about the torquedev