[torqueusers] Torque changes node state frequently

Danny Sternkopf dsternkopf at hpce.nec.com
Tue Aug 15 06:37:16 MDT 2006


Hi,

we updated our 200 nodes cluster to Torque version 2.1.0p0. (I know it 
is a bit outmoded meanwhile.)

I can see that Torque is changing the node state from free/job-exclusive 
to down and one minute laster back to the originally state.
This happens with all the nodes every 5-10 minutes.
The scheduler (Maui) doesn't like it if all resources are gone and 
blocks all the queued jobs.

Here an example:
08/15/2006 14:23:16;0004;PBS_Server;Svr;is_request;IS_STATUS received 
from noco120.nec
08/15/2006 14:23:16;0040;PBS_Server;Req;is_stat_get;received status from 
node noco120.nec
08/15/2006 14:24:02;0004;PBS_Server;Svr;is_request;message STATUS (4) 
received from mom on host noco120.nec (172.16.9.120:1023)
08/15/2006 14:24:02;0004;PBS_Server;Svr;is_request;IS_STATUS received 
from noco120.nec
08/15/2006 14:24:02;0040;PBS_Server;Req;is_stat_get;received status from 
node noco120.nec
08/15/2006 14:24:48;0004;PBS_Server;Svr;is_request;message STATUS (4) 
received from mom on host noco120.nec (172.16.9.120:1023)
08/15/2006 14:24:48;0004;PBS_Server;Svr;is_request;IS_STATUS received 
from noco120.nec
08/15/2006 14:24:48;0040;PBS_Server;Req;is_stat_get;received status from 
node noco120.nec
08/15/2006 14:25:34;0004;PBS_Server;Svr;is_request;message STATUS (4) 
received from mom on host noco120.nec (172.16.9.120:1023)
08/15/2006 14:25:34;0004;PBS_Server;Svr;is_request;IS_STATUS received 
from noco120.nec
08/15/2006 14:25:34;0040;PBS_Server;Req;is_stat_get;received status from 
node noco120.nec
08/15/2006 14:25:53;0004;PBS_Server;node;noco120.nec;attributes set:  at 
request of root at cacau1.nec
08/15/2006 14:25:53;0004;PBS_Server;node;noco120.nec;node noco120.nec 
state changed from job-exclusive to down,job-exclusive
08/15/2006 14:25:53;0004;PBS_Server;node;noco120.nec;attributes set: 
state - offline
08/15/2006 14:25:53;0004;PBS_Server;node;noco120.nec;attributes set: 
state + down
08/15/2006 14:26:20;0004;PBS_Server;Svr;is_request;message STATUS (4) 
received from mom on host noco120.nec (172.16.9.120:1023)
08/15/2006 14:26:20;0004;PBS_Server;Svr;is_request;IS_STATUS received 
from noco120.nec
08/15/2006 14:26:20;0040;PBS_Server;Req;is_stat_get;received status from 
node noco120.nec
08/15/2006 14:26:20;0040;PBS_Server;Req;update_node_state;node 
noco120.nec marked free
08/15/2006 14:27:06;0004;PBS_Server;Svr;is_request;message STATUS (4) 
received from mom on host noco120.nec (172.16.9.120:1023)
08/15/2006 14:27:06;0004;PBS_Server;Svr;is_request;IS_STATUS received 
from noco120.nec
08/15/2006 14:27:06;0040;PBS_Server;Req;is_stat_get;received status from 
node noco120.nec
08/15/2006 14:27:52;0004;PBS_Server;Svr;is_request;message STATUS (4) 
received from mom on host noco120.nec (172.16.9.120:1023)
08/15/2006 14:27:52;0004;PBS_Server;Svr;is_request;IS_STATUS received 
from noco120.nec
08/15/2006 14:27:52;0040;PBS_Server;Req;is_stat_get;received status from 
node noco120.nec
08/15/2006 14:28:38;0004;PBS_Server;Svr;is_request;message STATUS (4) 
received from mom on host noco120.nec (172.16.9.120:1023)
08/15/2006 14:28:38;0004;PBS_Server;Svr;is_request;IS_STATUS received 
from noco120.nec
08/15/2006 14:28:38;0040;PBS_Server;Req;is_stat_get;received status from 
node noco120.nec
08/15/2006 14:29:24;0004;PBS_Server;Svr;is_request;message STATUS (4) 
received from mom on host noco120.nec (172.16.9.120:1023)
08/15/2006 14:29:24;0004;PBS_Server;Svr;is_request;IS_STATUS received 
from noco120.nec
08/15/2006 14:29:24;0040;PBS_Server;Req;is_stat_get;received status from 
node noco120.nec
08/15/2006 14:30:10;0004;PBS_Server;Svr;is_request;message STATUS (4) 
received from mom on host noco120.nec (172.16.9.120:1023)
08/15/2006 14:30:10;0004;PBS_Server;Svr;is_request;IS_STATUS received 
from noco120.nec
08/15/2006 14:30:10;0040;PBS_Server;Req;is_stat_get;received status from 
node noco120.nec
08/15/2006 14:30:56;0004;PBS_Server;Svr;is_request;message STATUS (4) 
received from mom on host noco120.nec (172.16.9.120:1023)
08/15/2006 14:30:56;0004;PBS_Server;Svr;is_request;IS_STATUS received 
from noco120.nec
08/15/2006 14:30:56;0040;PBS_Server;Req;is_stat_get;received status from 
node noco120.nec
08/15/2006 14:31:42;0004;PBS_Server;Svr;is_request;message STATUS (4) 
received from mom on host noco120.nec (172.16.9.120:1023)
08/15/2006 14:31:42;0004;PBS_Server;Svr;is_request;IS_STATUS received 
from noco120.nec
08/15/2006 14:31:42;0040;PBS_Server;Req;is_stat_get;received status from 
node noco120.nec

(As I said the same also with free nodes!)

What could be the cause of that behavior?

PBS server attributes:
set server scheduler_iteration = 600
set server node_ping_rate = 300
set server node_check_rate = 600
set server tcp_timeout = 10
set server job_stat_rate = 300
set server poll_jobs = True

PBS MOm parameters:
$timeout 600

Thank you for your help and Best regards,

Danny


More information about the torqueusers mailing list