[torqueusers] nodes hung (or not?)

Sam Rash srash at yahoo-inc.com
Wed Sep 20 11:39:37 MDT 2006

So we have a small cluster of about 24 nodes.  We keep it packed about 8
hours a day.  I happened to check qstat -Q and find that only half the jobs
were in run state as should be.

Example: about 3000 jobs were in the queued state and eligible to run.  This
smelled of some nodes being down or something, so I ran pbsnodes -a on each
node.  They looked fine (even the 'empty ones')-entirely good to schedule
on.  Using maui checknode and checkjob, nothing was out of the ordinary
(normal msgs for jobs indicating eligible and ready to schedule)


I didn't have time to investigate (already 5+ hrs behind on production), so
I just restarted the nodes on the suspect boxes (about 15), and things went
along well.


Has anyone seen this?  This is the first time in over 10 months of use with
torque (and 3 with maui).  If it happens again hopefully I can check more
logs and get a better idea.

Any help is greatly appreciated.


Sam Rash

srash at yahoo-inc.com




-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20060920/ebc56db2/attachment.html

More information about the torqueusers mailing list