[torqueusers] pbs_mom keeps going down, how to diagnose

Damian Fermin dfermin at umich.edu
Mon Dec 22 21:30:40 MST 2008


I'm a bit of a torque newbie here so any advice will be appreciated.
I'm running torque/maui on a 32-core 4-node cluster with one 4-core  
head node.

Lately I've noticed that my users submit jobs and eventually the jobs  
don't finish.
The qstat says they are running but pbsnode -a reports that their  
node status as either "down" or "down, job-exclusive".
Running cexec -nnode_name "/sbin/service pbs_mom status" reports that  
pbs_mom is not running.

I can restart the process. I just want to know why/how its being  
There doesn't seem to be any rhyme or reason for when the pbs_mom  
goes down.

How do I  go about diagnosing this problem?

Again any and all advice is welcome.
Torque version: 2.1.8
Maui version: 3.2.6p20

I'm running Linux (RHEL 5.0) on an HP XPC cluster.

Let me know if there is any other information that may be helpful to  
diagnose this issue.
Thanks in advance for any and all advice.



Damian Fermin, Ph.D
dfermin at umich.edu
Pathology Department
University of Michigan
1300 Catherine St.
Ann Arbor, MI 48109

                   "There is no gene for the Human Spirit"
                                             -- GATTACA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20081222/5637acf6/attachment.html

More information about the torqueusers mailing list