[torqueusers] pbs_mom keeps going down, how to diagnose

Garrick Staples garrick at usc.edu
Mon Dec 22 21:59:28 MST 2008

On Mon, Dec 22, 2008 at 11:30:40PM -0500, Damian Fermin alleged:
> Hello.
> I'm a bit of a torque newbie here so any advice will be appreciated.
> I'm running torque/maui on a 32-core 4-node cluster with one 4-core  
> head node.
> Lately I've noticed that my users submit jobs and eventually the jobs  
> don't finish.
> The qstat says they are running but pbsnode -a reports that their  
> node status as either "down" or "down, job-exclusive".
> Running cexec -nnode_name "/sbin/service pbs_mom status" reports that  
> pbs_mom is not running.
> I can restart the process. I just want to know why/how its being  
> shutdown.
> There doesn't seem to be any rhyme or reason for when the pbs_mom  
> goes down.
> How do I  go about diagnosing this problem?
> Again any and all advice is welcome.
> Torque version: 2.1.8
> Maui version: 3.2.6p20
> I'm running Linux (RHEL 5.0) on an HP XPC cluster.
> Let me know if there is any other information that may be helpful to  
> diagnose this issue.
> Thanks in advance for any and all advice.

Look at the mom log, that will tell if the daemon is being shutdown.  Check
dmesg or syslog, that will tell if the daemon is segfaulting.

Garrick Staples, GNU/Linux HPCC SysAdmin
University of Southern California

See the Dishonor Roll at http://www.californiansagainsthate.com/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20081222/5de710b3/attachment.bin

More information about the torqueusers mailing list