[torqueusers] death of pbs_mom's

Garrick Staples garrick at clusterresources.com
Wed Oct 11 13:51:30 MDT 2006


On Wed, Oct 11, 2006 at 02:08:17PM -0400, Brock Palen alleged:
> On Oct 11, 2006, at 12:29 PM, Jerry Smith wrote:
> 
> >Brock,
> >
> >
> >>
> >>We have been using torque-2.1.2  for awhile on a amd64 rhel4 cluster.
> >>In the last few days we started seeing mom's dying on the nodes.
> >>While they are not dying enmass,  there has been no noticeable
> >>coralation between the nodes that have had them die.
> >>
> >>The logs say nothing.  This causes jobs to hand in the queue in the
> >>Running state.  And can only be removed by purging the job.   These
> >>hung jobs go on to cause Moab to get angry and not stop trying to
> >>remove the now very old job.  Thus does not sleep ever.
> >>
> >>Has anyone else ever seen a problem when a mom dies?
> >>
> >
> >
> >We had a similar problem a while back, and actually found a  
> >segfault ( this
> >was in the 2.0 series )
> >
> >Do you have :
> >PBSCOREDUMP=1
> >Set in your $PBS_HOME/environment
> I have not but will look into doing that.
> 
> >
> >This causes the mom to dump a core that you can look into, which  
> >gave us the
> >information to pass along to the torque-devs and resulted in a fix.
> >
> >
> >How are you "purging" the job?
> qdel -p JOBID  This deletes a job that says a cancel is already in  
> progress.

The -p option was only added as an ugly hack for the rare case of a node
going away and never coming back.  Don't use it routinely because it basicly
shoots TORQUE in the head.

If pbs_mom crashes, just start it again and let the job exit normally.



More information about the torqueusers mailing list