[torqueusers] death of pbs_mom's
garrick at clusterresources.com
Wed Oct 11 13:51:30 MDT 2006
On Wed, Oct 11, 2006 at 02:08:17PM -0400, Brock Palen alleged:
> On Oct 11, 2006, at 12:29 PM, Jerry Smith wrote:
> >>We have been using torque-2.1.2 for awhile on a amd64 rhel4 cluster.
> >>In the last few days we started seeing mom's dying on the nodes.
> >>While they are not dying enmass, there has been no noticeable
> >>coralation between the nodes that have had them die.
> >>The logs say nothing. This causes jobs to hand in the queue in the
> >>Running state. And can only be removed by purging the job. These
> >>hung jobs go on to cause Moab to get angry and not stop trying to
> >>remove the now very old job. Thus does not sleep ever.
> >>Has anyone else ever seen a problem when a mom dies?
> >We had a similar problem a while back, and actually found a
> >segfault ( this
> >was in the 2.0 series )
> >Do you have :
> >Set in your $PBS_HOME/environment
> I have not but will look into doing that.
> >This causes the mom to dump a core that you can look into, which
> >gave us the
> >information to pass along to the torque-devs and resulted in a fix.
> >How are you "purging" the job?
> qdel -p JOBID This deletes a job that says a cancel is already in
The -p option was only added as an ugly hack for the rare case of a node
going away and never coming back. Don't use it routinely because it basicly
shoots TORQUE in the head.
If pbs_mom crashes, just start it again and let the job exit normally.
More information about the torqueusers