[torqueusers] Cluster Node changing to state "down, job-exclusive"

David McGiven david.mcgiven at fusemail.com
Fri Feb 17 07:22:23 MST 2006


Jonas,

Thanks for your advice. Unfortunately this is not causing the problem.

The load is 1 or 1.1 at maximum.

There are not anyother processes demanding intensive CPU ussage. This node
only runs one job at a time, and every job is only one process. So load ~
1

Regards,

David
CTO

----- Original Message -----

>
>
> What is the load average on your machine
> when once the job has started?
> Could it be that the node is simply
> so overloaded that mom doesn't get enough cycles to process server
requests?
>
> Jonas
>
> Jonas Berlin Ph. D.
> Chief Architect
> Product & Systems Development
> Harte-Hanks
> 25 Linnell Circle
> Billerica, MA 01821
> USA
> Phone +1-978-436-2818
> Mobile +1-508-361-5921
> Fax +1-978-439-3940
> jberlin at hartehanks.com
>
>
>
>
>
>
>
> "David McGiven"
> <david.mcgiven at fusemail.com>
> Sent by: torqueusers-bounces at supercluster.org
> 02/17/2006 07:33 AM
>
>
>
> Please respond to
> david.mcgiven at fusemail.com
>
>
>
>
>
> To
> torqueusers at supercluster.org
>
>
> cc
>
>
>
> Subject
> [torqueusers] Cluster Node changing
> to state "down,job-exclusive"
>
>
>
>
>
>
>
>
>
> Dear Torque Users,
>
> I have a queue with only one node. I use this node to run a specific kind
> of jobs.
>
> Whenever I submit a job, it gets into the queue, maui tells the mom to
> start running it and it starts.
>
> First the node is marked as "free", then is marked as "job-exclusive".
> I
> ssh to the node and I see the process is running taking 99% CPU.
>
> This should be the normal behaviour. Then the weird things start.
>
> I wait a few seconds/minutes and the state changes to "down,
> job-exclusive". I ssh to the node and I see the process is STILL running,
> taking 99% CPU. The node is ok, the pbs_mom is running and I can contact
> to it with momctl from the central server.
>
> Then I issue a :
> bash# qdel 407
> qdel: Server could not connect to MOM 407.server
>
> I cannot delete the job. The only solution is doing :
>
> bash# qmgr
> Max open servers: 4
> Qmgr: set node nodo18 state -= down
>
> And then :
> bash# qdel 407
>
> Works
>
> Does anybody know why torque is behaving like this ? Do you know which
> logfiles or tools should I check ? (Checking
> /var/spool/PBS/mom_logs/logfile didn't help me diagnose the problem).
>
> Thank you very much in advance.
>
> Regards,
>
> David McGiven
> CTO
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>




More information about the torqueusers mailing list