[torqueusers] pbs_mom keeps going down, how to diagnose

Joshua Bernstein jbernstein at penguincomputing.com
Tue Dec 23 11:49:37 MST 2008



Garrick Staples wrote:
> On Mon, Dec 22, 2008 at 11:30:40PM -0500, Damian Fermin alleged:
>> Hello.
>>
>> I'm a bit of a torque newbie here so any advice will be appreciated.
>> I'm running torque/maui on a 32-core 4-node cluster with one 4-core  
>> head node.
>>
>> Lately I've noticed that my users submit jobs and eventually the jobs  
>> don't finish.
>> The qstat says they are running but pbsnode -a reports that their  
>> node status as either "down" or "down, job-exclusive".
>> Running cexec -nnode_name "/sbin/service pbs_mom status" reports that  
>> pbs_mom is not running.
>>
>> I can restart the process. I just want to know why/how its being  
>> shutdown.
>> There doesn't seem to be any rhyme or reason for when the pbs_mom  
>> goes down.
>>
>> How do I  go about diagnosing this problem?
>>
>> Again any and all advice is welcome.
>> Torque version: 2.1.8
>> Maui version: 3.2.6p20
>>
>> I'm running Linux (RHEL 5.0) on an HP XPC cluster.
>>
>> Let me know if there is any other information that may be helpful to  
>> diagnose this issue.
>> Thanks in advance for any and all advice.
> 
> Look at the mom log, that will tell if the daemon is being shutdown.  Check
> dmesg or syslog, that will tell if the daemon is segfaulting.

You can easily search through the syslog for any occurance of a segfault 
using:

# grep -i segfault /var/log/messages

I'd be curious to see if you are experiencing a segfaul with pbs_mom 
like the one I'm currently trying to diagnose myself.

-Joshua Bernstein
Software Engineer
Penguin Computing


More information about the torqueusers mailing list