[torquedev] [torqueusers] Time out (15082) in send_job

Joshua Bernstein jbernstein at penguincomputing.com
Tue Aug 24 13:57:39 MDT 2010



Ken Nielson wrote:
> On 08/24/2010 12:18 AM, Josh Bernstein wrote:
>> This is with Torque 2.3.10 and Moab 5.4.
>> 
>> -Josh
>> 
>> On Aug 23, 2010, at 10:53 PM, Ken Nielson 
>> <knielson at adaptivecomputing.com>  wrote:
>> 
>> 
>>> Is this TORQUE 5.4?
>>> 
>>> Ken
>>> 
>>> ----- Original Message ----- From: "Joshua
>>> Bernstein"<jbernstein at penguincomputing.com> To: "Torque Users
>>> Mailing List"<torqueusers at supercluster.org> Sent: Monday, August
>>> 23, 2010 1:06:50 PM Subject: [torqueusers] Time out (15082) in
>>> send_job
>>> 
>>> Hello Folks,
>>> 
>>> I'm seeing a ton of timeouts in send_job as shown by the log
>>> errors from pbs_server below. According to the published list of
>>> error codes, error 15082 isn't defined:
>>> 
>>> http://www.clusterresources.com/products/torque/docs/a.derrorcodes.shtml
>>> 
>>> 
>>> PBSPro suggests that this error is a "batch request generation
>>> failed". In fact, is this PDF I dug up, there are a host of other
>>> codes in here as well. Since TORQUE seems to use some of these,
>>> perhaps they should be added to the TORQUE docs?
>>> 
>>> https://secure.altair.com/docs/PBSproAG_53.pdf (around page 215)
>>> 
>>> Any thoughts on what could be going on here? Any ways to work
>>> around it? Perhaps this error code should be added to the docs?
>>> 
>>> The logs I reference above are shown here:
>>> 
>>> 08/23/2010
>>> 06:26:00;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::Time out 
>>> (15082) in send_job, child failed in previous commit request for
>>> job 2316886.scyld.localdomain 08/23/2010
>>> 06:26:00;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::Time out 
>>> (15082) in send_job, child failed in previous commit request for
>>> job 2316890.scyld.localdomain
>>> 
>>> -Joshua Bernstein Penguin Computing
>> 
> Josh,
> 
> Under TORQUE 2.3.10 15082 is defined to be PBSE_TIMEOUT. I checked 
> src/include/pbs_error.h and I referenced the URL you posted. They are
>  the same in both places. 
> http://www.clusterresources.com/products/torque/docs/a.derrorcodes.shtml

Wow. Did I miss something or did you update this?

> Looking in send_job it appears the call to svr_connect gets the 
> PBSE_TIMEOUT error.  Do you have more log information. It would be 
> helpful to know who is calling send_job. It is either svr_strtjob2
> (most likely) or net_move.

 From what I can see its svr_strtjob2(). Is there a server parameter 
that I can set that adjusts the TIMEOUT? Thanks for your help Ken.

-Joshua Bernstein
Penguin Computing


More information about the torquedev mailing list