[torquedev] [torqueusers] Time out (15082) in send_job
Joshua Bernstein
jbernstein at penguincomputing.com
Tue Aug 24 13:57:39 MDT 2010
Ken Nielson wrote:
> On 08/24/2010 12:18 AM, Josh Bernstein wrote:
>> This is with Torque 2.3.10 and Moab 5.4.
>>
>> -Josh
>>
>> On Aug 23, 2010, at 10:53 PM, Ken Nielson
>> <knielson at adaptivecomputing.com> wrote:
>>
>>
>>> Is this TORQUE 5.4?
>>>
>>> Ken
>>>
>>> ----- Original Message ----- From: "Joshua
>>> Bernstein"<jbernstein at penguincomputing.com> To: "Torque Users
>>> Mailing List"<torqueusers at supercluster.org> Sent: Monday, August
>>> 23, 2010 1:06:50 PM Subject: [torqueusers] Time out (15082) in
>>> send_job
>>>
>>> Hello Folks,
>>>
>>> I'm seeing a ton of timeouts in send_job as shown by the log
>>> errors from pbs_server below. According to the published list of
>>> error codes, error 15082 isn't defined:
>>>
>>> http://www.clusterresources.com/products/torque/docs/a.derrorcodes.shtml
>>>
>>>
>>> PBSPro suggests that this error is a "batch request generation
>>> failed". In fact, is this PDF I dug up, there are a host of other
>>> codes in here as well. Since TORQUE seems to use some of these,
>>> perhaps they should be added to the TORQUE docs?
>>>
>>> https://secure.altair.com/docs/PBSproAG_53.pdf (around page 215)
>>>
>>> Any thoughts on what could be going on here? Any ways to work
>>> around it? Perhaps this error code should be added to the docs?
>>>
>>> The logs I reference above are shown here:
>>>
>>> 08/23/2010
>>> 06:26:00;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::Time out
>>> (15082) in send_job, child failed in previous commit request for
>>> job 2316886.scyld.localdomain 08/23/2010
>>> 06:26:00;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::Time out
>>> (15082) in send_job, child failed in previous commit request for
>>> job 2316890.scyld.localdomain
>>>
>>> -Joshua Bernstein Penguin Computing
>>
> Josh,
>
> Under TORQUE 2.3.10 15082 is defined to be PBSE_TIMEOUT. I checked
> src/include/pbs_error.h and I referenced the URL you posted. They are
> the same in both places.
> http://www.clusterresources.com/products/torque/docs/a.derrorcodes.shtml
Wow. Did I miss something or did you update this?
> Looking in send_job it appears the call to svr_connect gets the
> PBSE_TIMEOUT error. Do you have more log information. It would be
> helpful to know who is calling send_job. It is either svr_strtjob2
> (most likely) or net_move.
From what I can see its svr_strtjob2(). Is there a server parameter
that I can set that adjusts the TIMEOUT? Thanks for your help Ken.
-Joshua Bernstein
Penguin Computing
More information about the torquedev
mailing list