[torqueusers] intermittent qsub failures with 4.2.7
cartley at hotmail.com
Fri Mar 21 16:26:50 MDT 2014
Matt, thanks for the idea. What value are you using? Looks like I have 5 for clientretry, and 0 for qsubsleep.
> From: msbritt at umich.edu
> To: torqueusers at supercluster.org
> Date: Fri, 21 Mar 2014 18:17:56 -0400
> Subject: Re: [torqueusers] intermittent qsub failures with 4.2.7
> Hi Craig - I don't believe that was the error syntax I remember, but we
> did have an issue w/ rapid qsub failures. Using 'clientretry', we no
> longer have had a problem with this (failures are just retried, so no
> attempted submission fails):
> In case that helps....
> - Matt
> Matthew Britt
> CAEN HPC Group - College of Engineering
> msbritt at umich.edu
> On 21 Mar 2014, at 17:33, Craig Artley wrote:
> > Hello, last November I inquired here about intermittent qsub failures
> > that we see several times a day on our cluster. We were using 4.1.6,
> > and a reply here indicated that this was a known problem that should
> > be fixed in the (then-forthcoming) 4.2.7 release.
> > Today I had a chance to build the new packages and apply them to all
> > of the nodes as well as the server. (That went very well, by the way.
> > I stopped the queues, let them drain out, refreshed and restarted
> > everything, and the jobs started releasing again. Very nice!)
> > However, I still got a couple of these qsub failures in a batch of
> > 700+ jobs.
> > Exit code = 196
> > Error: qsub: submit error (Invalid request MSG=cannot locate new job
> > 1320429.h2 (0 - Success))
> > So, do others see this? Am I missing some other configuration detail?
> > -craig
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> torqueusers mailing list
> torqueusers at supercluster.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers