[torqueusers] PBS unable to execute Job.
a2.wright at qut.edu.au
Sun Sep 11 16:13:18 MDT 2005
I checked syslog and found the error message below:
Sep 12 08:05:21 node010 pbs_mom: No such file or directory (2) in
TMomFinalizeChild, cannot open /usr/spool/PBS/aux/941.auriga.qut.edu.au
So I have created the directory '/usr/spool/PBS/aux' and I can now
I do not think to look in the syslog logs as I thought all the error
were logging to the mom logs.
Garrick Staples wrote:
>On Thu, Sep 08, 2005 at 12:39:15PM +1000, Ashley Wright alleged:
>>I have increase the loglevel to 3. And one of the messages I get is:
>>09/08/2005 12:33:50;0001; pbs_mom;Job;914.auriga.qut.edu.au;phase 2 of
>>job launch successfully completed
>>09/08/2005 12:33:50;0001; pbs_mom;Job;TMomFinalizeJob3;read start
>>return code=-1 session=127
>>09/08/2005 12:33:50;0001; pbs_mom;Job;TMomFinalizeJob3;job not
>>started, Failure job exec failure, before files staged, no retry
>>09/08/2005 12:33:50;0001; pbs_mom;Job;914.auriga.qut.edu.au;ALERT:
>>job failed phase 3 start, server will retry
>>09/08/2005 12:33:50;0008; pbs_mom;Req;send_sisters;sending ABORT to
>>What is 'phase 3'? It seems to say this is before the files are staged.
>phase 2 launches the child process that will eventually becomes the job.
>phase 3 is MOM reading a status code from the child telling it if the
>child was successful.
>The fact that the parent got a -1 means that the child caught on error
>and exited. Unfortunately it is really hard to debug problems in the
>child process, partly because it can't write to the mom log.
>Did you configure torque with --enable-syslog? If so, the child should
>syslog any errors.
>>A little furthur on it seems like the files are copied and the job is
>>09/08/2005 12:33:50;0100; pbs_mom;Req;;Type CopyFiles request received
>>from PBS_Server at mgt, sock=10
>>09/08/2005 12:33:50;0008; pbs_mom;Job;process_request;request type
>>CopyFiles from host mgt allowed
>>09/08/2005 12:33:50;0004; pbs_mom;Fil;914.auriga.qut.edu.au;forking to
>>user, uid: 1001 gid: 100 homedir: '/home/wright4'
>Hrm, I don't think that should be happening after the child has failed.
>torqueusers mailing list
>torqueusers at supercluster.org
a2.wright at qut.edu.au
HPC and Research Support Group
Queensland University of Technology (QUT)
More information about the torqueusers