[torquedev] [torqueusers] Time out (15082) in send_job

Al Taufer ataufer at adaptivecomputing.com
Tue Aug 24 14:04:48 MDT 2010


----- Original Message -----
> Ken Nielson wrote:
> > On 08/24/2010 12:18 AM, Josh Bernstein wrote:
> >> This is with Torque 2.3.10 and Moab 5.4.
> >>
> >> -Josh
> >>
> >> On Aug 23, 2010, at 10:53 PM, Ken Nielson
> >> <knielson at adaptivecomputing.com> wrote:
> >>
> >>
> >>> Is this TORQUE 5.4?
> >>>
> >>> Ken
> >>>
> >>> ----- Original Message ----- From: "Joshua
> >>> Bernstein"<jbernstein at penguincomputing.com> To: "Torque Users
> >>> Mailing List"<torqueusers at supercluster.org> Sent: Monday, August
> >>> 23, 2010 1:06:50 PM Subject: [torqueusers] Time out (15082) in
> >>> send_job
> >>>
> >>> Hello Folks,
> >>>
> >>> I'm seeing a ton of timeouts in send_job as shown by the log
> >>> errors from pbs_server below. According to the published list of
> >>> error codes, error 15082 isn't defined:
> >>>
> >>> http://www.clusterresources.com/products/torque/docs/a.derrorcodes.shtml
> >>>
> >>>
> >>> PBSPro suggests that this error is a "batch request generation
> >>> failed". In fact, is this PDF I dug up, there are a host of other
> >>> codes in here as well. Since TORQUE seems to use some of these,
> >>> perhaps they should be added to the TORQUE docs?
> >>>
> >>> https://secure.altair.com/docs/PBSproAG_53.pdf (around page 215)
> >>>
> >>> Any thoughts on what could be going on here? Any ways to work
> >>> around it? Perhaps this error code should be added to the docs?
> >>>
> >>> The logs I reference above are shown here:
> >>>
> >>> 08/23/2010
> >>> 06:26:00;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::Time out
> >>> (15082) in send_job, child failed in previous commit request for
> >>> job 2316886.scyld.localdomain 08/23/2010
> >>> 06:26:00;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::Time out
> >>> (15082) in send_job, child failed in previous commit request for
> >>> job 2316890.scyld.localdomain
> >>>
> >>> -Joshua Bernstein Penguin Computing
> >>
> > Josh,
> >
> > Under TORQUE 2.3.10 15082 is defined to be PBSE_TIMEOUT. I checked
> > src/include/pbs_error.h and I referenced the URL you posted. They
> > are
> >  the same in both places.
> > http://www.clusterresources.com/products/torque/docs/a.derrorcodes.shtml
> 
> Wow. Did I miss something or did you update this?

You didn't miss it.  It got updated yesterday afternoon.

> 
> > Looking in send_job it appears the call to svr_connect gets the
> > PBSE_TIMEOUT error. Do you have more log information. It would be
> > helpful to know who is calling send_job. It is either svr_strtjob2
> > (most likely) or net_move.
> 
> From what I can see its svr_strtjob2(). Is there a server parameter
> that I can set that adjusts the TIMEOUT? Thanks for your help Ken.
> 
> -Joshua Bernstein
> Penguin Computing
> _______________________________________________
> torquedev mailing list
> torquedev at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torquedev

Al Taufer
Adaptive Computing




More information about the torquedev mailing list