[torqueusers] nodes hung (or not?)

Garrick Staples garrick at clusterresources.com
Wed Sep 20 17:37:00 MDT 2006

On Wed, Sep 20, 2006 at 10:39:37AM -0700, Sam Rash alleged:
> So we have a small cluster of about 24 nodes.  We keep it packed about 8
> hours a day.  I happened to check qstat -Q and find that only half the jobs
> were in run state as should be.
> Example: about 3000 jobs were in the queued state and eligible to run.  This
> smelled of some nodes being down or something, so I ran pbsnodes -a on each
> node.  They looked fine (even the 'empty ones')-entirely good to schedule
> on.  Using maui checknode and checkjob, nothing was out of the ordinary
> (normal msgs for jobs indicating eligible and ready to schedule)

Nodes were "free" in TORQUE and jobs weren't "deferred" in maui, and
nodes were sitting idle?
If the nodes appeared idle, and maui wasn't scheduling jobs, that would
imply a problem on the maui end of things, but ...
> I didn't have time to investigate (already 5+ hrs behind on production), so
> I just restarted the nodes on the suspect boxes (about 15), and things went
> along well.

Restarting pbs_mom on nodes fixed it?  That implies a problem on the
TORQUE end of things.

> Has anyone seen this?  This is the first time in over 10 months of use with
> torque (and 3 with maui).  If it happens again hopefully I can check more
> logs and get a better idea.
> Any help is greatly appreciated.

I can't think of a scenerio that fits this description.

