[torqueusers] intermittent qsub failures with 4.2.7

Matthew Britt msbritt at umich.edu
Mon Mar 24 08:46:11 MDT 2014


We hve CLIENTRETRY set to 60 to be safe (and are using the default for 
QSUBSLEEP of 0), although in my testing, I don't believe I've ever seen 
the retry take more than one retry per qsub.

  - Matt


--------------------------------------------
Matthew Britt
CAEN HPC Group - College of Engineering
msbritt at umich.edu


On 21 Mar 2014, at 18:26, Craig Artley wrote:

> Matt, thanks for the idea. What value are you using? Looks like I have 
> 5 for clientretry, and 0 for qsubsleep.
>
> QSUBSLEEP     0
> CLIENTRETRY   5
>
> -craig
>
>> From: msbritt at umich.edu
>> To: torqueusers at supercluster.org
>> Date: Fri, 21 Mar 2014 18:17:56 -0400
>> Subject: Re: [torqueusers] intermittent qsub failures with 4.2.7
>>
>> Hi Craig - I don't believe that was the error syntax I remember, but 
>> we
>> did have an issue w/ rapid qsub failures.   Using 'clientretry', we 
>> no
>> longer have had a problem with this (failures are just retried, so no
>> attempted submission fails):
>>
>> http://docs.adaptivecomputing.com/torque/4-2-6/help.htm#topics/12-appendices/torque.cfgConfigFile.htm?Highlight=clientretry
>>
>> In case that helps....
>>
>>  - Matt
>>
>> --------------------------------------------
>> Matthew Britt
>> CAEN HPC Group - College of Engineering
>> msbritt at umich.edu
>>
>>
>> On 21 Mar 2014, at 17:33, Craig Artley wrote:
>>
>>> Hello, last November I inquired here about intermittent qsub 
>>> failures
>>> that we see several times a day on our cluster. We were using 4.1.6,
>>> and a reply here indicated that this was a known problem that should
>>> be fixed in the (then-forthcoming) 4.2.7 release.
>>>
>>> Today I had a chance to build the new packages and apply them to all
>>> of the nodes as well as the server. (That went very well, by the 
>>> way.
>>> I stopped the queues, let them drain out, refreshed and restarted
>>> everything, and the jobs started releasing again. Very nice!)
>>>
>>> However, I still got a couple of these qsub failures in a batch of
>>> 700+ jobs.
>>>
>>> Exit code = 196
>>> Error: qsub: submit error (Invalid request MSG=cannot locate new job
>>> 1320429.h2 (0 - Success))
>>>
>>> So, do others see this? Am I missing some other configuration 
>>> detail?
>>>
>>> -craig
>>> 		 	   		_______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
> 		 	   		_______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list