[torqueusers] PBS unable to execute Job.
Ashley Wright
a2.wright at qut.edu.au
Sun Sep 11 16:13:18 MDT 2005
Thanks Garrick,
I checked syslog and found the error message below:
Sep 12 08:05:21 node010 pbs_mom: No such file or directory (2) in
TMomFinalizeChild, cannot open /usr/spool/PBS/aux/941.auriga.qut.edu.au
So I have created the directory '/usr/spool/PBS/aux' and I can now
submit jobs.
I do not think to look in the syslog logs as I thought all the error
were logging to the mom logs.
Thanks,
Ashley
Garrick Staples wrote:
>On Thu, Sep 08, 2005 at 12:39:15PM +1000, Ashley Wright alleged:
>
>
>>Thanks Chris,
>>
>>I have increase the loglevel to 3. And one of the messages I get is:
>>
>>09/08/2005 12:33:50;0001; pbs_mom;Job;914.auriga.qut.edu.au;phase 2 of
>>job launch successfully completed
>>09/08/2005 12:33:50;0001; pbs_mom;Job;TMomFinalizeJob3;read start
>>return code=-1 session=127
>>09/08/2005 12:33:50;0001; pbs_mom;Job;TMomFinalizeJob3;job not
>>started, Failure job exec failure, before files staged, no retry
>>09/08/2005 12:33:50;0001; pbs_mom;Job;914.auriga.qut.edu.au;ALERT:
>>job failed phase 3 start, server will retry
>>09/08/2005 12:33:50;0008; pbs_mom;Req;send_sisters;sending ABORT to
>>sisters
>>
>>What is 'phase 3'? It seems to say this is before the files are staged.
>>
>>
>phase 2 launches the child process that will eventually becomes the job.
>phase 3 is MOM reading a status code from the child telling it if the
>child was successful.
>
>The fact that the parent got a -1 means that the child caught on error
>and exited. Unfortunately it is really hard to debug problems in the
>child process, partly because it can't write to the mom log.
>
>Did you configure torque with --enable-syslog? If so, the child should
>syslog any errors.
>
>
>
>
>>A little furthur on it seems like the files are copied and the job is
>>forked:
>>
>>09/08/2005 12:33:50;0100; pbs_mom;Req;;Type CopyFiles request received
>>from PBS_Server at mgt, sock=10
>>09/08/2005 12:33:50;0008; pbs_mom;Job;process_request;request type
>>CopyFiles from host mgt allowed
>>09/08/2005 12:33:50;0004; pbs_mom;Fil;914.auriga.qut.edu.au;forking to
>>user, uid: 1001 gid: 100 homedir: '/home/wright4'
>>
>>
>
>Hrm, I don't think that should be happening after the child has failed.
>
>
>
>------------------------------------------------------------------------
>
>_______________________________________________
>torqueusers mailing list
>torqueusers at supercluster.org
>http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
--
Ashley Wright
3864 9264
a2.wright at qut.edu.au
HPC and Research Support Group
Queensland University of Technology (QUT)
More information about the torqueusers
mailing list