[torqueusers] death of pbs_mom's

Jerry Smith jdsmit at sandia.gov
Wed Oct 11 10:29:49 MDT 2006


Brock,


> 
> We have been using torque-2.1.2  for awhile on a amd64 rhel4 cluster.
> In the last few days we started seeing mom's dying on the nodes.
> While they are not dying enmass,  there has been no noticeable
> coralation between the nodes that have had them die.
> 
> The logs say nothing.  This causes jobs to hand in the queue in the
> Running state.  And can only be removed by purging the job.   These
> hung jobs go on to cause Moab to get angry and not stop trying to
> remove the now very old job.  Thus does not sleep ever.
> 
> Has anyone else ever seen a problem when a mom dies?
> 


We had a similar problem a while back, and actually found a segfault ( this
was in the 2.0 series )

Do you have :
PBSCOREDUMP=1 
Set in your $PBS_HOME/environment

This causes the mom to dump a core that you can look into, which gave us the
information to pass along to the torque-devs and resulted in a fix.


How are you "purging" the job?



Jerry




More information about the torqueusers mailing list