[torqueusers] PBS unable to execute Job.

Garrick Staples garrick at usc.edu
Fri Sep 9 16:01:52 MDT 2005

On Thu, Sep 08, 2005 at 12:39:15PM +1000, Ashley Wright alleged:
> Thanks Chris,
> I have increase the loglevel to 3. And one of the messages I get is:
> 09/08/2005 12:33:50;0001;   pbs_mom;Job;914.auriga.qut.edu.au;phase 2 of 
> job launch successfully completed
> 09/08/2005 12:33:50;0001;   pbs_mom;Job;TMomFinalizeJob3;read start 
> return code=-1 session=127
> 09/08/2005 12:33:50;0001;   pbs_mom;Job;TMomFinalizeJob3;job not 
> started, Failure job exec failure, before files staged, no retry
> 09/08/2005 12:33:50;0001;   pbs_mom;Job;914.auriga.qut.edu.au;ALERT:  
> job failed phase 3 start, server will retry
> 09/08/2005 12:33:50;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
> sisters
> What is 'phase 3'?  It seems to say this is before the files are staged.
phase 2 launches the child process that will eventually becomes the job.
phase 3 is MOM reading a status code from the child telling it if the
child was successful.

The fact that the parent got a -1 means that the child caught on error
and exited.  Unfortunately it is really hard to debug problems in the
child process, partly because it can't write to the mom log.

Did you configure torque with --enable-syslog?  If so, the child should
syslog any errors.

> A little furthur on it seems like the files are copied and the job is 
> forked:
> 09/08/2005 12:33:50;0100;   pbs_mom;Req;;Type CopyFiles request received 
> from PBS_Server at mgt, sock=10
> 09/08/2005 12:33:50;0008;   pbs_mom;Job;process_request;request type 
> CopyFiles from host mgt allowed
> 09/08/2005 12:33:50;0004;   pbs_mom;Fil;914.auriga.qut.edu.au;forking to 
> user, uid: 1001  gid: 100  homedir: '/home/wright4'

Hrm, I don't think that should be happening after the child has failed.

Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050909/e07bed6b/attachment.bin

More information about the torqueusers mailing list