[torqueusers] Cluster Node changing to state "down, job-exclusive"

David McGiven david.mcgiven at fusemail.com
Mon Feb 20 04:21:57 MST 2006


Dear Ken,

Sorry for the delay getting back to you. This is what my
/var/spool/PBS/server_logs/log says about that job :

02/20/2006 12:12:05;0004;PBS_Server;Svr;check_nodes;node nodo18 not
detected in 168 seconds, marking node down
02/20/2006 12:12:05;0040;PBS_Server;Req;update_node_state;node nodo18
marked down

Also I've noticed that when doing a "qstat" it reports a "0" under the
"Time Use" Column.

I think I know what is causing the problem. The job changes the system
date to a day in 2005, then it starts doing some simulations.

If I comment out that line of code that changes the system date,
everything else goes ok. The "Time Use" column reports correct numbers,
the node do not go to "down, job-exclusive" state, etc...

Can this be the problem ? Maybe pbs_mom goes crazy in some way when you
change the date backwards while it is running.

If that is the problem, how can we solve this ? The company that used to
support that software suddenly disappeared some months ago (I'm not going
to say names, but you can easily find in google) but we need to keep on
using it the software, that's why we change the date to the last time we
bought the license.

We cannot contact them for support either.

That also serves as a reply for Garrick's message. Thanks Garrick!

Best Regards,
David McGiven

----- Original Message -----

> Hello David,
>
> What does the server log (/var/spool/PBS/server_logs/...) tell you?
> The time period of interest would be from the last successful status
> to the failure.  If it shows a timeout, my first question would be,
> does this happen for all jobs or just a particular one?  Also, is
> it possible that the job (process) does a very large memory
> allocation/initialization or other form of blocking I/O at about
> the time of the failure?  You do not need to have a high load average
> to cause a communication failure.  Any process that would cause the
> kernel to run uninterruptible would have the same effect.  I have not
> looked at the source code in quite some time, but, I think that I
> recall that the socket timeout is not set to be very large.  Hope that
> this helps.
>
> -- Ken Matney, Sr.
>    Technology Integration Group
>    National Center for Computational Sciences
>    Oak Ridge National Laboratory
>
>
> -----Original Message-----
> From: torqueusers-bounces at supercluster.org on behalf of David McGiven
> Sent: Fri 2/17/2006 9:22 AM
> To: torqueusers at supercluster.org
> Cc: Jonas_Berlin at harte-hanks.com
> Subject: Re: [torqueusers] Cluster Node changing to state "down,
job-exclusive"
>
>
> Jonas,
>
> Thanks for your advice. Unfortunately this is not causing the problem.
>
> The load is 1 or 1.1 at maximum.
>
> There are not anyother processes demanding intensive CPU ussage. This node
> only runs one job at a time, and every job is only one process. So load ~
> 1
>
> Regards,
>
> David
> CTO
>
> ----- Original Message -----
>
> >
> >
> > What is the load average on your machine
> > when once the job has started?
> > Could it be that the node is simply
> > so overloaded that mom doesn't get enough cycles to process server
> requests?
> >
> > Jonas
> >
> > Jonas Berlin Ph. D.
> > Chief Architect
> > Product & Systems Development
> > Harte-Hanks
> > 25 Linnell Circle
> > Billerica, MA 01821
> > USA
> > Phone +1-978-436-2818
> > Mobile +1-508-361-5921
> > Fax +1-978-439-3940
> > jberlin at hartehanks.com
> >
> >
> >
> >
> >
> >
> >
> > "David McGiven"
> > <david.mcgiven at fusemail.com>
> > Sent by: torqueusers-bounces at supercluster.org
> > 02/17/2006 07:33 AM
> >
> >
> >
> > Please respond to
> > david.mcgiven at fusemail.com
> >
> >
> >
> >
> >
> > To
> > torqueusers at supercluster.org
> >
> >
> > cc
> >
> >
> >
> > Subject
> > [torqueusers] Cluster Node changing
> > to state "down,job-exclusive"
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > Dear Torque Users,
> >
> > I have a queue with only one node. I use this node to run a specific kind
> > of jobs.
> >
> > Whenever I submit a job, it gets into the queue, maui tells the mom to
> > start running it and it starts.
> >
> > First the node is marked as "free", then is marked as "job-exclusive".
> > I ssh to the node and I see the process is running taking 99% CPU.
> >
> > This should be the normal behaviour. Then the weird things start.
> >
> > I wait a few seconds/minutes and the state changes to "down,
> > job-exclusive". I ssh to the node and I see the process is STILL running,
> > taking 99% CPU. The node is ok, the pbs_mom is running and I can contact
> > to it with momctl from the central server.
> >
> > Then I issue a :
> > bash# qdel 407
> > qdel: Server could not connect to MOM 407.server
> >
> > I cannot delete the job. The only solution is doing :
> >
> > bash# qmgr
> > Max open servers: 4
> > Qmgr: set node nodo18 state -= down
> >
> > And then :
> > bash# qdel 407
> >
> > Works
> >
> > Does anybody know why torque is behaving like this ? Do you know which
> > logfiles or tools should I check ? (Checking
> > /var/spool/PBS/mom_logs/logfile didn't help me diagnose the problem).
> >
> > Thank you very much in advance.
> >
> > Regards,
> >
> > David McGiven
> > CTO
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >
> >
> >
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>
>


More information about the torqueusers mailing list