[torqueusers] death of pbs_mom's
Jerry Smith
jdsmit at sandia.gov
Wed Oct 11 10:29:49 MDT 2006
Brock,
>
> We have been using torque-2.1.2 for awhile on a amd64 rhel4 cluster.
> In the last few days we started seeing mom's dying on the nodes.
> While they are not dying enmass, there has been no noticeable
> coralation between the nodes that have had them die.
>
> The logs say nothing. This causes jobs to hand in the queue in the
> Running state. And can only be removed by purging the job. These
> hung jobs go on to cause Moab to get angry and not stop trying to
> remove the now very old job. Thus does not sleep ever.
>
> Has anyone else ever seen a problem when a mom dies?
>
We had a similar problem a while back, and actually found a segfault ( this
was in the 2.0 series )
Do you have :
PBSCOREDUMP=1
Set in your $PBS_HOME/environment
This causes the mom to dump a core that you can look into, which gave us the
information to pass along to the torque-devs and resulted in a fix.
How are you "purging" the job?
Jerry
More information about the torqueusers
mailing list