[torqueusers] PBS unable to execute Job.

Ashley Wright a2.wright at qut.edu.au
Sun Sep 11 16:13:18 MDT 2005


Thanks Garrick,

I checked syslog and found the error message below:
Sep 12 08:05:21 node010 pbs_mom: No such file or directory (2) in 
TMomFinalizeChild, cannot open /usr/spool/PBS/aux/941.auriga.qut.edu.au

So I have created the directory '/usr/spool/PBS/aux' and I can now 
submit jobs.
I do not think to look in the syslog logs as I thought all the error 
were logging to the mom logs.

Thanks,
Ashley

Garrick Staples wrote:

>On Thu, Sep 08, 2005 at 12:39:15PM +1000, Ashley Wright alleged:
>  
>
>>Thanks Chris,
>>
>>I have increase the loglevel to 3. And one of the messages I get is:
>>
>>09/08/2005 12:33:50;0001;   pbs_mom;Job;914.auriga.qut.edu.au;phase 2 of 
>>job launch successfully completed
>>09/08/2005 12:33:50;0001;   pbs_mom;Job;TMomFinalizeJob3;read start 
>>return code=-1 session=127
>>09/08/2005 12:33:50;0001;   pbs_mom;Job;TMomFinalizeJob3;job not 
>>started, Failure job exec failure, before files staged, no retry
>>09/08/2005 12:33:50;0001;   pbs_mom;Job;914.auriga.qut.edu.au;ALERT:  
>>job failed phase 3 start, server will retry
>>09/08/2005 12:33:50;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
>>sisters
>>
>>What is 'phase 3'?  It seems to say this is before the files are staged.
>>    
>>
>phase 2 launches the child process that will eventually becomes the job.
>phase 3 is MOM reading a status code from the child telling it if the
>child was successful.
>
>The fact that the parent got a -1 means that the child caught on error
>and exited.  Unfortunately it is really hard to debug problems in the
>child process, partly because it can't write to the mom log.
>
>Did you configure torque with --enable-syslog?  If so, the child should
>syslog any errors.
>
>
>  
>
>>A little furthur on it seems like the files are copied and the job is 
>>forked:
>>
>>09/08/2005 12:33:50;0100;   pbs_mom;Req;;Type CopyFiles request received 
>>from PBS_Server at mgt, sock=10
>>09/08/2005 12:33:50;0008;   pbs_mom;Job;process_request;request type 
>>CopyFiles from host mgt allowed
>>09/08/2005 12:33:50;0004;   pbs_mom;Fil;914.auriga.qut.edu.au;forking to 
>>user, uid: 1001  gid: 100  homedir: '/home/wright4'
>>    
>>
>
>Hrm, I don't think that should be happening after the child has failed.
>
>  
>
>------------------------------------------------------------------------
>
>_______________________________________________
>torqueusers mailing list
>torqueusers at supercluster.org
>http://www.supercluster.org/mailman/listinfo/torqueusers
>  
>


-- 
Ashley Wright
3864 9264
a2.wright at qut.edu.au
HPC and Research Support Group
Queensland University of Technology (QUT)



More information about the torqueusers mailing list