[torqueusers] Torque changes node state frequently
Danny Sternkopf
dsternkopf at hpce.nec.com
Tue Aug 15 06:37:16 MDT 2006
Hi,
we updated our 200 nodes cluster to Torque version 2.1.0p0. (I know it
is a bit outmoded meanwhile.)
I can see that Torque is changing the node state from free/job-exclusive
to down and one minute laster back to the originally state.
This happens with all the nodes every 5-10 minutes.
The scheduler (Maui) doesn't like it if all resources are gone and
blocks all the queued jobs.
Here an example:
08/15/2006 14:23:16;0004;PBS_Server;Svr;is_request;IS_STATUS received
from noco120.nec
08/15/2006 14:23:16;0040;PBS_Server;Req;is_stat_get;received status from
node noco120.nec
08/15/2006 14:24:02;0004;PBS_Server;Svr;is_request;message STATUS (4)
received from mom on host noco120.nec (172.16.9.120:1023)
08/15/2006 14:24:02;0004;PBS_Server;Svr;is_request;IS_STATUS received
from noco120.nec
08/15/2006 14:24:02;0040;PBS_Server;Req;is_stat_get;received status from
node noco120.nec
08/15/2006 14:24:48;0004;PBS_Server;Svr;is_request;message STATUS (4)
received from mom on host noco120.nec (172.16.9.120:1023)
08/15/2006 14:24:48;0004;PBS_Server;Svr;is_request;IS_STATUS received
from noco120.nec
08/15/2006 14:24:48;0040;PBS_Server;Req;is_stat_get;received status from
node noco120.nec
08/15/2006 14:25:34;0004;PBS_Server;Svr;is_request;message STATUS (4)
received from mom on host noco120.nec (172.16.9.120:1023)
08/15/2006 14:25:34;0004;PBS_Server;Svr;is_request;IS_STATUS received
from noco120.nec
08/15/2006 14:25:34;0040;PBS_Server;Req;is_stat_get;received status from
node noco120.nec
08/15/2006 14:25:53;0004;PBS_Server;node;noco120.nec;attributes set: at
request of root at cacau1.nec
08/15/2006 14:25:53;0004;PBS_Server;node;noco120.nec;node noco120.nec
state changed from job-exclusive to down,job-exclusive
08/15/2006 14:25:53;0004;PBS_Server;node;noco120.nec;attributes set:
state - offline
08/15/2006 14:25:53;0004;PBS_Server;node;noco120.nec;attributes set:
state + down
08/15/2006 14:26:20;0004;PBS_Server;Svr;is_request;message STATUS (4)
received from mom on host noco120.nec (172.16.9.120:1023)
08/15/2006 14:26:20;0004;PBS_Server;Svr;is_request;IS_STATUS received
from noco120.nec
08/15/2006 14:26:20;0040;PBS_Server;Req;is_stat_get;received status from
node noco120.nec
08/15/2006 14:26:20;0040;PBS_Server;Req;update_node_state;node
noco120.nec marked free
08/15/2006 14:27:06;0004;PBS_Server;Svr;is_request;message STATUS (4)
received from mom on host noco120.nec (172.16.9.120:1023)
08/15/2006 14:27:06;0004;PBS_Server;Svr;is_request;IS_STATUS received
from noco120.nec
08/15/2006 14:27:06;0040;PBS_Server;Req;is_stat_get;received status from
node noco120.nec
08/15/2006 14:27:52;0004;PBS_Server;Svr;is_request;message STATUS (4)
received from mom on host noco120.nec (172.16.9.120:1023)
08/15/2006 14:27:52;0004;PBS_Server;Svr;is_request;IS_STATUS received
from noco120.nec
08/15/2006 14:27:52;0040;PBS_Server;Req;is_stat_get;received status from
node noco120.nec
08/15/2006 14:28:38;0004;PBS_Server;Svr;is_request;message STATUS (4)
received from mom on host noco120.nec (172.16.9.120:1023)
08/15/2006 14:28:38;0004;PBS_Server;Svr;is_request;IS_STATUS received
from noco120.nec
08/15/2006 14:28:38;0040;PBS_Server;Req;is_stat_get;received status from
node noco120.nec
08/15/2006 14:29:24;0004;PBS_Server;Svr;is_request;message STATUS (4)
received from mom on host noco120.nec (172.16.9.120:1023)
08/15/2006 14:29:24;0004;PBS_Server;Svr;is_request;IS_STATUS received
from noco120.nec
08/15/2006 14:29:24;0040;PBS_Server;Req;is_stat_get;received status from
node noco120.nec
08/15/2006 14:30:10;0004;PBS_Server;Svr;is_request;message STATUS (4)
received from mom on host noco120.nec (172.16.9.120:1023)
08/15/2006 14:30:10;0004;PBS_Server;Svr;is_request;IS_STATUS received
from noco120.nec
08/15/2006 14:30:10;0040;PBS_Server;Req;is_stat_get;received status from
node noco120.nec
08/15/2006 14:30:56;0004;PBS_Server;Svr;is_request;message STATUS (4)
received from mom on host noco120.nec (172.16.9.120:1023)
08/15/2006 14:30:56;0004;PBS_Server;Svr;is_request;IS_STATUS received
from noco120.nec
08/15/2006 14:30:56;0040;PBS_Server;Req;is_stat_get;received status from
node noco120.nec
08/15/2006 14:31:42;0004;PBS_Server;Svr;is_request;message STATUS (4)
received from mom on host noco120.nec (172.16.9.120:1023)
08/15/2006 14:31:42;0004;PBS_Server;Svr;is_request;IS_STATUS received
from noco120.nec
08/15/2006 14:31:42;0040;PBS_Server;Req;is_stat_get;received status from
node noco120.nec
(As I said the same also with free nodes!)
What could be the cause of that behavior?
PBS server attributes:
set server scheduler_iteration = 600
set server node_ping_rate = 300
set server node_check_rate = 600
set server tcp_timeout = 10
set server job_stat_rate = 300
set server poll_jobs = True
PBS MOm parameters:
$timeout 600
Thank you for your help and Best regards,
Danny
More information about the torqueusers
mailing list