[torqueusers] intermittent qsub failures

David Beer dbeer at adaptivecomputing.com
Thu Nov 21 08:17:22 MST 2013


I thought that we had this fixed in 4.2.6, but it looks like the fix is
currently only in Jarvik. We can get this released with 4.2.7.


On Wed, Nov 20, 2013 at 7:56 PM, Craig Artley <cartley at hotmail.com> wrote:

> This is with 4.1.6.
>
> What version / branch do you recommend?
>
>   -craig
>
> ________________________________
> > Date: Wed, 20 Nov 2013 12:10:02 -0700
> > From: dbeer at adaptivecomputing.com
> > To: torqueusers at supercluster.org
> > Subject: Re: [torqueusers] intermittent qsub failures
> >
> > What version are you getting this error on? We had a related fix
> recently.
> >
> >
> > On Tue, Nov 19, 2013 at 7:20 PM, Craig Artley
> > <cartley at hotmail.com<mailto:cartley at hotmail.com>> wrote:
> > I am seeing intermittent qsub failures. It seems to be related to load
> > --- several hundred jobs submitted. Every once in a while, qsub fails
> > with "Unknown Job Id Error" or "can not locate new job":
> >
> > Exit code = 153
> > Error: qsub: submit error (Unknown Job Id Error)
> >
> > Exit code = 196
> > Error: qsub: submit error (Invalid request MSG=can not locate new
> > job 630254.h2 (0 - Success))
> >
> > In the server log, I find messages like these:
> >
> > 11/19/2013 01:16:42;0080;PBS_Server.27108;Job;625027.h2;Unknown Job Id
> Error
> >
> > 11/19/2013 01:16:42;0080;PBS_Server.27108;Req;req_reject;Reject reply
> > code=15001(Unknown Job Id Error MSG=cannot locate job), aux=0,
> > type=DeleteJob, from joeuser at g4
> >
> >
> > 11/19/2013
> > 14:41:44;0001;PBS_Server.29564;Svr;PBS_Server;LOG_ERROR::Invalid
> > request (15004) in req_jobscript, can not locate new job 630254.h2 (0 -
> > Success)
> > 11/19/2013 14:41:44;0100;PBS_Server.27141;Job;630253.h2;enqueuing into
> > parallel, state 1 hop 1
> > 11/19/2013 14:41:44;0080;PBS_Server.29564;Req;req_reject;Reject reply
> > code=15004(Invalid request MSG=can not locate new job 630254.h2 (0 -
> > Success)), aux=0, type=JobScript, from joeuser at g4
> >
> > So far I haven't found anything helpful. Please let me know if you have
> > idea what's going on.
> >
> > By the way, we were having lots of problems with Torque and NFS, but
> > after configuring torque as recommended in
> > http://www.supercluster.org/pipermail/torqueusers/2011-March/012425.html
> ,
> > those problems went away and our reliability improved dramatically. Now
> > all that remains are the two occasional problems above.
> >
> > -craig
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org<mailto:torqueusers at supercluster.org>
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >
> >
> >
> > --
> > David Beer | Senior Software Engineer
> > Adaptive Computing
> >
> > _______________________________________________ torqueusers mailing
> > list torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



-- 
David Beer | Senior Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131121/3d136db5/attachment.html 


More information about the torqueusers mailing list