[torqueusers] intermittent qsub failures with 4.2.7

Matthew Britt msbritt at umich.edu
Fri Mar 21 16:17:56 MDT 2014


Hi Craig - I don't believe that was the error syntax I remember, but we 
did have an issue w/ rapid qsub failures.   Using 'clientretry', we no 
longer have had a problem with this (failures are just retried, so no 
attempted submission fails):

http://docs.adaptivecomputing.com/torque/4-2-6/help.htm#topics/12-appendices/torque.cfgConfigFile.htm?Highlight=clientretry

In case that helps....

    - Matt

--------------------------------------------
Matthew Britt
CAEN HPC Group - College of Engineering
msbritt at umich.edu


On 21 Mar 2014, at 17:33, Craig Artley wrote:

> Hello, last November I inquired here about intermittent qsub failures 
> that we see several times a day on our cluster. We were using 4.1.6, 
> and a reply here indicated that this was a known problem that should 
> be fixed in the (then-forthcoming) 4.2.7 release.
>
> Today I had a chance to build the new packages and apply them to all 
> of the nodes as well as the server. (That went very well, by the way. 
> I stopped the queues, let them drain out, refreshed and restarted 
> everything, and the jobs started releasing again. Very nice!)
>
> However, I still got a couple of these qsub failures in a batch of 
> 700+ jobs.
>
> Exit code = 196
> Error: qsub: submit error (Invalid request MSG=cannot locate new job 
> 1320429.h2 (0 - Success))
>
> So, do others see this? Am I missing some other configuration detail?
>
> -craig
> 		 	   		_______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list