[torquedev] Jobs killed because of EOF between running pbs_mom's

Nate Woody Nate.A.Woody at runbox.com
Thu Jul 10 05:51:01 MDT 2008


A retry for this would be great.  We have had a similar effect when people would submit job-array like things with stage-ins.  We'd get 1000 batch jobs submitted that all needed to stage data chunks and the network would rapidly get swamped and we'd get that error from moms as well (as well as breaking the job submission for a while).  We ended up solving it by putting a wrapper in what Torque uses for stage-in, but a more general solution would be welcomed.

Best,
Nate


----- Start Original Message -----
Sent: Thu, 10 Jul 2008 07:06:30 -0400 (EDT)
From: "Toni L. Harbaugh-Blackford [Contr]" <harbaugh at ncifcrf.gov>
To: Chris Samuel <csamuel at vpac.org>
Subject: Re: [torquedev] Jobs killed because of EOF between running pbs_mom's

> Chris-
> 
> On Thu, 10 Jul 2008, Chris Samuel wrote:
> 
>   > Hi all,
>   > 
>   > Occasionally we get users running jobs that do silly
>   > things like writing tens-of-gigabyte error files to
>   > the users home directory, over NFS, from multiple nodes
>   > and jobs at the same time.
>   > 
>   > At that point our NFS server slows down, and for some
>   > reason we then also see pbs_mom starting to hit trouble.
>   > 
>   > Connections between the pbs_mom's start to time out, and
>   > we get paralle jobs getting killed off with errors like:
>   > 
>   > PBS: job killed: node 7 (tango035) requested job terminate, 'EOF'
>   > (code 1099) - internal or network failure attempting to communicate with
>   > sister MOM's
>   > 
>   > In reality the job should be allowed to continue running.
>   > 
>   > During the time period of the network storm we had around
>   > 500-1000 jobs end, though it's hard to tell precisely how
>   > many of those were caught by this problem.
>   > 
>   > Previously we were running with a patch that set the
>   > pbs_tcp_timeout value to 60s (back up from its new default
>   > of 20 seconds), but I'm thinking now of knocking it up
>   > again to try and cope with this.
>   > 
>   > I'd *really* like a way to actually stop pbs_mom
>   > from killing jobs across nodes at all when it gets
>   > SISTER_EOF, and just retry..
>   > 
>   > Thoughts ?
> 
> Yes, I think retry would be better too, with a descriptive
> log entry (including the job id) being made for at least the first
> retry.
> 
> Thanks,
> Toni
> 
>   > 
>   > Chris
>   > -- 
>   > Christopher Samuel - (03) 9925 4751 - Systems Manager
>   >  The Victorian Partnership for Advanced Computing
>   >  P.O. Box 201, Carlton South, VIC 3053, Australia
>   > VPAC is a not-for-profit Registered Research Agency
>   > _______________________________________________
>   > torquedev mailing list
>   > torquedev at supercluster.org
>   > http://www.supercluster.org/mailman/listinfo/torquedev
>   > 
> 
> -----------------------------------------------------------------------------
> Toni Harbaugh-Blackford, [Contractor]
> System Administrator, Advanced Biomedical Computing Center
> Advanced Technology Program, Bldg. 430/122
> SAIC-Frederick, Inc. / NCI at Frederick
> P.O. Box B, Frederick, MD 21702
> Phone: 301/846-5798    Fax:  301/846-5762
> Email: harbaugh at ncifcrf.gov
> 
> 
> NOTICE:  This communication may contain privileged or other confidential
> information.  If you are not the intended recipient, or believe that you
> have received this communication in error, please do not print, copy,
> retransmit, disseminate or otherwise use the information.  Please indicate
> to the sender that you have received this email in error and delete the copy
> you received.
> 
> 
> _______________________________________________
> torquedev mailing list
> torquedev at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torquedev
> 

----- End Original Message -----


More information about the torquedev mailing list