[torqueusers] PBS unable to execute Job.

Ashley Wright a2.wright at qut.edu.au
Sun Sep 11 18:41:45 MDT 2005


Thanks Garrick,

The error was easy to fix, it was just hard to find the right log file.
Would it be worth while putting a list of log files in the 
troubleshooting section of the online manual?

The way we have set up our cluster is probably a little different from 
the way that is recommended. We have installed torque into /usr/local 
which is a NFS share, however /usr in general is not shared. So we 
copied over the the directories and configuration files needed to start. 
I am not sure how this directory was left out as it was working early 
last week ???

Thanks again,
Ashley

Garrick Staples wrote:

>Dave, we should have a "startup sanity check" to warn for these types of
>conditions when MOM first launches.
>
>
>Ashely, sorry that was so hard to find, but I'm glad it's working for
>you.
>
>
>On Mon, Sep 12, 2005 at 08:13:18AM +1000, Ashley Wright alleged:
>  
>
>>Thanks Garrick,
>>
>>I checked syslog and found the error message below:
>>Sep 12 08:05:21 node010 pbs_mom: No such file or directory (2) in 
>>TMomFinalizeChild, cannot open /usr/spool/PBS/aux/941.auriga.qut.edu.au
>>
>>So I have created the directory '/usr/spool/PBS/aux' and I can now 
>>submit jobs.
>>I do not think to look in the syslog logs as I thought all the error 
>>were logging to the mom logs.
>>
>>Thanks,
>>Ashley
>>
>>Garrick Staples wrote:
>>
>>    
>>
>>>On Thu, Sep 08, 2005 at 12:39:15PM +1000, Ashley Wright alleged:
>>>
>>>
>>>      
>>>
>>>>Thanks Chris,
>>>>
>>>>I have increase the loglevel to 3. And one of the messages I get is:
>>>>
>>>>09/08/2005 12:33:50;0001;   pbs_mom;Job;914.auriga.qut.edu.au;phase 2 of 
>>>>job launch successfully completed
>>>>09/08/2005 12:33:50;0001;   pbs_mom;Job;TMomFinalizeJob3;read start 
>>>>return code=-1 session=127
>>>>09/08/2005 12:33:50;0001;   pbs_mom;Job;TMomFinalizeJob3;job not 
>>>>started, Failure job exec failure, before files staged, no retry
>>>>09/08/2005 12:33:50;0001;   pbs_mom;Job;914.auriga.qut.edu.au;ALERT:  
>>>>job failed phase 3 start, server will retry
>>>>09/08/2005 12:33:50;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
>>>>sisters
>>>>
>>>>What is 'phase 3'?  It seems to say this is before the files are staged.
>>>>  
>>>>
>>>>        
>>>>
>>>phase 2 launches the child process that will eventually becomes the job.
>>>phase 3 is MOM reading a status code from the child telling it if the
>>>child was successful.
>>>
>>>The fact that the parent got a -1 means that the child caught on error
>>>and exited.  Unfortunately it is really hard to debug problems in the
>>>child process, partly because it can't write to the mom log.
>>>
>>>Did you configure torque with --enable-syslog?  If so, the child should
>>>syslog any errors.
>>>
>>>
>>>
>>>
>>>      
>>>
>>>>A little furthur on it seems like the files are copied and the job is 
>>>>forked:
>>>>
>>>>09/08/2005 12:33:50;0100;   pbs_mom;Req;;Type CopyFiles request received 
>>>>        
>>>>
>>>>from PBS_Server at mgt, sock=10
>>>      
>>>
>>>>09/08/2005 12:33:50;0008;   pbs_mom;Job;process_request;request type 
>>>>CopyFiles from host mgt allowed
>>>>09/08/2005 12:33:50;0004;   pbs_mom;Fil;914.auriga.qut.edu.au;forking to 
>>>>user, uid: 1001  gid: 100  homedir: '/home/wright4'
>>>>  
>>>>
>>>>        
>>>>
>>>Hrm, I don't think that should be happening after the child has failed.
>>>
>>>
>>>
>>>------------------------------------------------------------------------
>>>
>>>_______________________________________________
>>>torqueusers mailing list
>>>torqueusers at supercluster.org
>>>http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>>      
>>>
>>-- 
>>Ashley Wright
>>3864 9264
>>a2.wright at qut.edu.au
>>HPC and Research Support Group
>>Queensland University of Technology (QUT)
>>
>>    
>>
>
>  
>
>------------------------------------------------------------------------
>
>_______________________________________________
>torqueusers mailing list
>torqueusers at supercluster.org
>http://www.supercluster.org/mailman/listinfo/torqueusers
>  
>


-- 
Ashley Wright
3864 9264
a2.wright at qut.edu.au
HPC and Research Support Group
Queensland University of Technology (QUT)




More information about the torqueusers mailing list