[torquedev] [torqueusers] Time out (15082) in send_job

Ken Nielson knielson at adaptivecomputing.com
Tue Aug 24 09:17:23 MDT 2010


On 08/24/2010 12:18 AM, Josh Bernstein wrote:
> This is with Torque 2.3.10 and Moab 5.4.
>
> -Josh
>
> On Aug 23, 2010, at 10:53 PM, Ken Nielson
> <knielson at adaptivecomputing.com>  wrote:
>
>    
>> Is this TORQUE 5.4?
>>
>> Ken
>>
>> ----- Original Message -----
>> From: "Joshua Bernstein"<jbernstein at penguincomputing.com>
>> To: "Torque Users Mailing List"<torqueusers at supercluster.org>
>> Sent: Monday, August 23, 2010 1:06:50 PM
>> Subject: [torqueusers] Time out (15082) in send_job
>>
>> Hello Folks,
>>
>> I'm seeing a ton of timeouts in send_job as shown by the log errors from
>> pbs_server below. According to the published list of error codes, error
>> 15082 isn't defined:
>>
>> http://www.clusterresources.com/products/torque/docs/a.derrorcodes.shtml
>>
>> PBSPro suggests that this error is a "batch request generation failed".
>> In fact, is this PDF I dug up, there are a host of other codes in here
>> as well. Since TORQUE seems to use some of these, perhaps they should be
>> added to the TORQUE docs?
>>
>> https://secure.altair.com/docs/PBSproAG_53.pdf (around page 215)
>>
>> Any thoughts on what could be going on here? Any ways to work around it?
>> Perhaps this error code should be added to the docs?
>>
>> The logs I reference above are shown here:
>>
>> 08/23/2010 06:26:00;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::Time out
>> (15082) in send_job, child failed in previous commit request for job
>> 2316886.scyld.localdomain
>> 08/23/2010 06:26:00;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::Time out
>> (15082) in send_job, child failed in previous commit request for job
>> 2316890.scyld.localdomain
>>
>> -Joshua Bernstein
>> Penguin Computing
>
Josh,

Under TORQUE 2.3.10 15082 is defined to be PBSE_TIMEOUT. I checked 
src/include/pbs_error.h and I referenced the URL you posted. They are 
the same in both places. 
http://www.clusterresources.com/products/torque/docs/a.derrorcodes.shtml

Looking in send_job it appears the call to svr_connect gets the 
PBSE_TIMEOUT error.  Do you have more log information. It would be 
helpful to know who is calling send_job. It is either svr_strtjob2 (most 
likely) or net_move.

Thanks

Ken





More information about the torquedev mailing list