[torqueusers] Fwd: Failover takes long time in HA mode

Clotho Tsang wytsang at clustertech.com
Tue Sep 17 20:00:09 MDT 2013


Torque in our cluster are setup in HA mode (running pbs_server with --ha
option)
The failover takes over 3 minutes when the node where active pbs_server was
running goes down.
Please take a look at pbs_server log:
19:20 is the time when another pbs_server is dying.
19:23 is the time when pbs_server on this node becomes active (at the time
when qstat can show something)


09/07/2013 19:20:15;0002;PBS_Server.7333;**Svr;Log;Log opened
09/07/2013 19:20:15;0006;PBS_Server.7333;**Svr;PBS_Server;Server
master.localdomain started, initialization type = 1
09/07/2013 19:20:15;0002;PBS_Server.7333;**Svr;get_default_threads;**Defaulting
min_threads to 33 threads
09/07/2013 19:20:15;0002;PBS_Server.7333;**Svr;Act;Account file
/var/spool/torque/server_priv/**accounting/20130907 opened
09/07/2013 19:20:15;0040;PBS_Server.7333;**Req;setup_nodes;setup_nodes()
09/07/2013 19:20:15;0086;PBS_Server.7333;**Svr;PBS_Server;Recovered queue
prepost_q
09/07/2013 19:20:15;0086;PBS_Server.7333;**Svr;PBS_Server;Recovered queue
prepost_q_high
09/07/2013 19:20:15;0086;PBS_Server.7333;**Svr;PBS_Server;Recovered queue
model_q_high
09/07/2013 19:20:15;0086;PBS_Server.7333;**Svr;PBS_Server;Recovered queue
model_q
09/07/2013 19:20:15;0086;PBS_Server.7333;**Svr;PBS_Server;Recovered queue
batch
09/07/2013 19:20:15;0002;PBS_Server.7333;**Svr;PBS_Server;Expected 5,
recovered 5 queues
09/07/2013 19:21:07;0080;PBS_Server.7333;**Svr;PBS_Server;1000 files read
from disk
09/07/2013 19:22:00;0080;PBS_Server.7333;**Svr;PBS_Server;2000 files read
from disk
09/07/2013 19:22:53;0080;PBS_Server.7333;**Svr;PBS_Server;3000 files read
from disk
09/07/2013 19:23:48;0080;PBS_Server.7333;**Svr;PBS_Server;4000 files read
from disk
09/07/2013 19:23:51;0080;PBS_Server.7333;**Svr;PBS_Server;4038 total files
read from disk
09/07/2013 19:23:51;0100;PBS_Server.7333;**Job;10448.master.localdomain;**enqueuing
into model_q, state 6 hop 1
09/07/2013 19:23:51;0086;PBS_Server.7333;**Job;10448.master.localdomain;**Requeueing
job, substate: 59 Requeued in queue: model_q
09/07/2013 19:23:51;0100;PBS_Server.7333;**Job;10881.master.localdomain;**enqueuing
into prepost_q_high, state 2 hop 1

We also found that when the length of job history grows, the time for
failover takes much longer (grows exponentially)

Apparently pbs_server read thousands of files before becoming active, where
are the files? Is there any method to reduce failover time?

Thanks very much.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130918/2931c709/attachment.html 


More information about the torqueusers mailing list