[torqueusers] PBS unable to execute Job.
Ashley Wright
a2.wright at qut.edu.au
Sun Sep 11 18:41:45 MDT 2005
Thanks Garrick,
The error was easy to fix, it was just hard to find the right log file.
Would it be worth while putting a list of log files in the
troubleshooting section of the online manual?
The way we have set up our cluster is probably a little different from
the way that is recommended. We have installed torque into /usr/local
which is a NFS share, however /usr in general is not shared. So we
copied over the the directories and configuration files needed to start.
I am not sure how this directory was left out as it was working early
last week ???
Thanks again,
Ashley
Garrick Staples wrote:
>Dave, we should have a "startup sanity check" to warn for these types of
>conditions when MOM first launches.
>
>
>Ashely, sorry that was so hard to find, but I'm glad it's working for
>you.
>
>
>On Mon, Sep 12, 2005 at 08:13:18AM +1000, Ashley Wright alleged:
>
>
>>Thanks Garrick,
>>
>>I checked syslog and found the error message below:
>>Sep 12 08:05:21 node010 pbs_mom: No such file or directory (2) in
>>TMomFinalizeChild, cannot open /usr/spool/PBS/aux/941.auriga.qut.edu.au
>>
>>So I have created the directory '/usr/spool/PBS/aux' and I can now
>>submit jobs.
>>I do not think to look in the syslog logs as I thought all the error
>>were logging to the mom logs.
>>
>>Thanks,
>>Ashley
>>
>>Garrick Staples wrote:
>>
>>
>>
>>>On Thu, Sep 08, 2005 at 12:39:15PM +1000, Ashley Wright alleged:
>>>
>>>
>>>
>>>
>>>>Thanks Chris,
>>>>
>>>>I have increase the loglevel to 3. And one of the messages I get is:
>>>>
>>>>09/08/2005 12:33:50;0001; pbs_mom;Job;914.auriga.qut.edu.au;phase 2 of
>>>>job launch successfully completed
>>>>09/08/2005 12:33:50;0001; pbs_mom;Job;TMomFinalizeJob3;read start
>>>>return code=-1 session=127
>>>>09/08/2005 12:33:50;0001; pbs_mom;Job;TMomFinalizeJob3;job not
>>>>started, Failure job exec failure, before files staged, no retry
>>>>09/08/2005 12:33:50;0001; pbs_mom;Job;914.auriga.qut.edu.au;ALERT:
>>>>job failed phase 3 start, server will retry
>>>>09/08/2005 12:33:50;0008; pbs_mom;Req;send_sisters;sending ABORT to
>>>>sisters
>>>>
>>>>What is 'phase 3'? It seems to say this is before the files are staged.
>>>>
>>>>
>>>>
>>>>
>>>phase 2 launches the child process that will eventually becomes the job.
>>>phase 3 is MOM reading a status code from the child telling it if the
>>>child was successful.
>>>
>>>The fact that the parent got a -1 means that the child caught on error
>>>and exited. Unfortunately it is really hard to debug problems in the
>>>child process, partly because it can't write to the mom log.
>>>
>>>Did you configure torque with --enable-syslog? If so, the child should
>>>syslog any errors.
>>>
>>>
>>>
>>>
>>>
>>>
>>>>A little furthur on it seems like the files are copied and the job is
>>>>forked:
>>>>
>>>>09/08/2005 12:33:50;0100; pbs_mom;Req;;Type CopyFiles request received
>>>>
>>>>
>>>>from PBS_Server at mgt, sock=10
>>>
>>>
>>>>09/08/2005 12:33:50;0008; pbs_mom;Job;process_request;request type
>>>>CopyFiles from host mgt allowed
>>>>09/08/2005 12:33:50;0004; pbs_mom;Fil;914.auriga.qut.edu.au;forking to
>>>>user, uid: 1001 gid: 100 homedir: '/home/wright4'
>>>>
>>>>
>>>>
>>>>
>>>Hrm, I don't think that should be happening after the child has failed.
>>>
>>>
>>>
>>>------------------------------------------------------------------------
>>>
>>>_______________________________________________
>>>torqueusers mailing list
>>>torqueusers at supercluster.org
>>>http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>>
>>>
>>--
>>Ashley Wright
>>3864 9264
>>a2.wright at qut.edu.au
>>HPC and Research Support Group
>>Queensland University of Technology (QUT)
>>
>>
>>
>
>
>
>------------------------------------------------------------------------
>
>_______________________________________________
>torqueusers mailing list
>torqueusers at supercluster.org
>http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
--
Ashley Wright
3864 9264
a2.wright at qut.edu.au
HPC and Research Support Group
Queensland University of Technology (QUT)
More information about the torqueusers
mailing list