[torqueusers] Cluster Node changing to state "down, job-exclusive"

Garrick Staples garrick at usc.edu
Sun Feb 19 23:59:34 MST 2006


On Fri, Feb 17, 2006 at 06:33:05AM -0600, David McGiven alleged:
> 
> Dear Torque Users,
> 
> I have a queue with only one node. I use this node to run a specific kind
> of jobs.
> 
> Whenever I submit a job, it gets into the queue, maui tells the mom to
> start running it and it starts.
> 
> First the node is marked as "free", then is marked as "job-exclusive". I
> ssh to the node and I see the process is running taking 99% CPU.
> 
> This should be the normal behaviour. Then the weird things start.
> 
> I wait a few seconds/minutes and the state changes to "down,
> job-exclusive". I ssh to the node and I see the process is STILL running,
> taking 99% CPU. The node is ok, the pbs_mom is running and I can contact
> to it with momctl from the central server.
> 
> Then I issue a :
> bash# qdel 407
> qdel: Server could not connect to MOM 407.server
> 
> I cannot delete the job. The only solution is doing :
> 
> bash# qmgr
> Max open servers: 4
> Qmgr: set node nodo18 state -= down
> 
> And then :
> bash# qdel 407
> 
> Works
> 
> Does anybody know why torque is behaving like this ? Do you know which
> logfiles or tools should I check ? (Checking
> /var/spool/PBS/mom_logs/logfile didn't help me diagnose the problem).

What about in the server log file?  Do you have any host acls?

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20060219/ef354b81/attachment.bin


More information about the torqueusers mailing list