[torqueusers] pbs_mom keeps going down, how to diagnose
Damian Fermin
dfermin at umich.edu
Mon Dec 22 21:30:40 MST 2008
Hello.
I'm a bit of a torque newbie here so any advice will be appreciated.
I'm running torque/maui on a 32-core 4-node cluster with one 4-core
head node.
Lately I've noticed that my users submit jobs and eventually the jobs
don't finish.
The qstat says they are running but pbsnode -a reports that their
node status as either "down" or "down, job-exclusive".
Running cexec -nnode_name "/sbin/service pbs_mom status" reports that
pbs_mom is not running.
I can restart the process. I just want to know why/how its being
shutdown.
There doesn't seem to be any rhyme or reason for when the pbs_mom
goes down.
How do I go about diagnosing this problem?
Again any and all advice is welcome.
Torque version: 2.1.8
Maui version: 3.2.6p20
I'm running Linux (RHEL 5.0) on an HP XPC cluster.
Let me know if there is any other information that may be helpful to
diagnose this issue.
Thanks in advance for any and all advice.
Damian
========================================================================
========
Damian Fermin, Ph.D
dfermin at umich.edu
Pathology Department
University of Michigan
1300 Catherine St.
Ann Arbor, MI 48109
734.615.0302
------------------------------------------------------------------------
--------
"There is no gene for the Human Spirit"
-- GATTACA
------------------------------------------------------------------------
--------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20081222/5637acf6/attachment.html
More information about the torqueusers
mailing list