[torqueusers] PBS unable to execute Job.

Garrick Staples garrick at usc.edu
Sun Sep 11 16:30:44 MDT 2005


Dave, we should have a "startup sanity check" to warn for these types of
conditions when MOM first launches.


Ashely, sorry that was so hard to find, but I'm glad it's working for
you.


On Mon, Sep 12, 2005 at 08:13:18AM +1000, Ashley Wright alleged:
> Thanks Garrick,
> 
> I checked syslog and found the error message below:
> Sep 12 08:05:21 node010 pbs_mom: No such file or directory (2) in 
> TMomFinalizeChild, cannot open /usr/spool/PBS/aux/941.auriga.qut.edu.au
> 
> So I have created the directory '/usr/spool/PBS/aux' and I can now 
> submit jobs.
> I do not think to look in the syslog logs as I thought all the error 
> were logging to the mom logs.
> 
> Thanks,
> Ashley
> 
> Garrick Staples wrote:
> 
> >On Thu, Sep 08, 2005 at 12:39:15PM +1000, Ashley Wright alleged:
> > 
> >
> >>Thanks Chris,
> >>
> >>I have increase the loglevel to 3. And one of the messages I get is:
> >>
> >>09/08/2005 12:33:50;0001;   pbs_mom;Job;914.auriga.qut.edu.au;phase 2 of 
> >>job launch successfully completed
> >>09/08/2005 12:33:50;0001;   pbs_mom;Job;TMomFinalizeJob3;read start 
> >>return code=-1 session=127
> >>09/08/2005 12:33:50;0001;   pbs_mom;Job;TMomFinalizeJob3;job not 
> >>started, Failure job exec failure, before files staged, no retry
> >>09/08/2005 12:33:50;0001;   pbs_mom;Job;914.auriga.qut.edu.au;ALERT:  
> >>job failed phase 3 start, server will retry
> >>09/08/2005 12:33:50;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
> >>sisters
> >>
> >>What is 'phase 3'?  It seems to say this is before the files are staged.
> >>   
> >>
> >phase 2 launches the child process that will eventually becomes the job.
> >phase 3 is MOM reading a status code from the child telling it if the
> >child was successful.
> >
> >The fact that the parent got a -1 means that the child caught on error
> >and exited.  Unfortunately it is really hard to debug problems in the
> >child process, partly because it can't write to the mom log.
> >
> >Did you configure torque with --enable-syslog?  If so, the child should
> >syslog any errors.
> >
> >
> > 
> >
> >>A little furthur on it seems like the files are copied and the job is 
> >>forked:
> >>
> >>09/08/2005 12:33:50;0100;   pbs_mom;Req;;Type CopyFiles request received 
> >>from PBS_Server at mgt, sock=10
> >>09/08/2005 12:33:50;0008;   pbs_mom;Job;process_request;request type 
> >>CopyFiles from host mgt allowed
> >>09/08/2005 12:33:50;0004;   pbs_mom;Fil;914.auriga.qut.edu.au;forking to 
> >>user, uid: 1001  gid: 100  homedir: '/home/wright4'
> >>   
> >>
> >
> >Hrm, I don't think that should be happening after the child has failed.
> >
> > 
> >
> >------------------------------------------------------------------------
> >
> >_______________________________________________
> >torqueusers mailing list
> >torqueusers at supercluster.org
> >http://www.supercluster.org/mailman/listinfo/torqueusers
> > 
> >
> 
> 
> -- 
> Ashley Wright
> 3864 9264
> a2.wright at qut.edu.au
> HPC and Research Support Group
> Queensland University of Technology (QUT)
> 

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050911/aa3295cd/attachment.bin


More information about the torqueusers mailing list