[torqueusers] pbs_mom keeps going down, how to diagnose
jbernstein at penguincomputing.com
Tue Dec 23 11:49:37 MST 2008
Garrick Staples wrote:
> On Mon, Dec 22, 2008 at 11:30:40PM -0500, Damian Fermin alleged:
>> I'm a bit of a torque newbie here so any advice will be appreciated.
>> I'm running torque/maui on a 32-core 4-node cluster with one 4-core
>> head node.
>> Lately I've noticed that my users submit jobs and eventually the jobs
>> don't finish.
>> The qstat says they are running but pbsnode -a reports that their
>> node status as either "down" or "down, job-exclusive".
>> Running cexec -nnode_name "/sbin/service pbs_mom status" reports that
>> pbs_mom is not running.
>> I can restart the process. I just want to know why/how its being
>> There doesn't seem to be any rhyme or reason for when the pbs_mom
>> goes down.
>> How do I go about diagnosing this problem?
>> Again any and all advice is welcome.
>> Torque version: 2.1.8
>> Maui version: 3.2.6p20
>> I'm running Linux (RHEL 5.0) on an HP XPC cluster.
>> Let me know if there is any other information that may be helpful to
>> diagnose this issue.
>> Thanks in advance for any and all advice.
> Look at the mom log, that will tell if the daemon is being shutdown. Check
> dmesg or syslog, that will tell if the daemon is segfaulting.
You can easily search through the syslog for any occurance of a segfault
# grep -i segfault /var/log/messages
I'd be curious to see if you are experiencing a segfaul with pbs_mom
like the one I'm currently trying to diagnose myself.
More information about the torqueusers