[torqueusers] Cluster Node changing to state "down, job-exclusive"
Garrick Staples
garrick at usc.edu
Sun Feb 19 23:59:34 MST 2006
On Fri, Feb 17, 2006 at 06:33:05AM -0600, David McGiven alleged:
>
> Dear Torque Users,
>
> I have a queue with only one node. I use this node to run a specific kind
> of jobs.
>
> Whenever I submit a job, it gets into the queue, maui tells the mom to
> start running it and it starts.
>
> First the node is marked as "free", then is marked as "job-exclusive". I
> ssh to the node and I see the process is running taking 99% CPU.
>
> This should be the normal behaviour. Then the weird things start.
>
> I wait a few seconds/minutes and the state changes to "down,
> job-exclusive". I ssh to the node and I see the process is STILL running,
> taking 99% CPU. The node is ok, the pbs_mom is running and I can contact
> to it with momctl from the central server.
>
> Then I issue a :
> bash# qdel 407
> qdel: Server could not connect to MOM 407.server
>
> I cannot delete the job. The only solution is doing :
>
> bash# qmgr
> Max open servers: 4
> Qmgr: set node nodo18 state -= down
>
> And then :
> bash# qdel 407
>
> Works
>
> Does anybody know why torque is behaving like this ? Do you know which
> logfiles or tools should I check ? (Checking
> /var/spool/PBS/mom_logs/logfile didn't help me diagnose the problem).
What about in the server log file? Do you have any host acls?
--
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20060219/ef354b81/attachment.bin
More information about the torqueusers
mailing list