[torqueusers] Disappearing Nodes

gianfranco sciacca gs at hep.ucl.ac.uk
Wed Mar 23 09:43:14 MST 2005


I have a similar problem in a newly setup test system (server/mom/sched
+ one mom node). The node turns up with <
state = state-unknown,down> at server+mom+sched startp.

I can mark it free from qmgr <s n nodename state=free> and after a few
minutes it would go down again (it would run jobs if marked free, but it
goes down soon thereafter). Scheduler used is the built-in, out of
the box config.

Server log after server startup is:
===============
03/23/2005 15:58:51;0002;PBS_Server;Svr;Log;Log opened
03/23/2005 15:58:51;0006;PBS_Server;Svr;PBS_Server;Server xx.xx.xx.xx
started, initialization type = 1
03/23/2005 15:58:51;0002;PBS_Server;Svr;Act;Account file
/usr/spool/PBS/server_priv/accounting/20050323 opened
03/23/2005 15:58:51;0040;PBS_Server;Req;setup_nodes;setup_nodes()
 
03/23/2005 15:58:51;0086;PBS_Server;Svr;PBS_Server;Recovered queue
gridshort
03/23/2005 15:58:51;0086;PBS_Server;Svr;PBS_Server;Recovered queue
medium
03/23/2005 15:58:51;0086;PBS_Server;Svr;PBS_Server;Recovered queue long
03/23/2005 15:58:51;0086;PBS_Server;Svr;PBS_Server;Recovered queue short
03/23/2005 15:58:51;0086;PBS_Server;Svr;PBS_Server;Recovered queue bulk
03/23/2005 15:58:51;0086;PBS_Server;Svr;PBS_Server;Recovered queue
gridlong
03/23/2005 15:58:51;0002;PBS_Server;Svr;PBS_Server;Expected 6, recovered
6 queues
03/23/2005 15:58:51;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered
0 jobs
03/23/2005 15:58:51;0006;PBS_Server;Svr;PBS_Server;Using ports
Server:15001  Scheduler:15004  MOM:15002
03/23/2005 15:58:51;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid =
7332
03/23/2005 15:58:51;0040;PBS_Server;Svr;xx.xx.xx.xx;Scheduler sent
command scheduler_first
03/23/2005 15:58:51;0100;PBS_Server;Req;;Type StatusServer request
received from Scheduler at xx.xx.xx.xx, sock=9
03/23/2005 15:58:51;0100;PBS_Server;Req;;Type StatusNode request
received from Scheduler at xx.xx.xx.xx, sock=9
03/23/2005 15:58:51;0100;PBS_Server;Req;;Type StatusQueue request
received from Scheduler at xx.xx.xx.xx, sock=9
03/23/2005 15:58:51;0100;PBS_Server;Req;;Type SelStat request received
from Scheduler at xx.xx.xx.xx, sock=9
=====.....repeated 5 more times.....then:=====
03/23/2005 15:58:56;0040;PBS_Server;Req;is_stat_get;node xx.xx.xx.xx
marked available
===============

xx.xx.xx.xx is the server/mom node. It does not mark the mom node as
available. All the rest seems ok.

Mom log is:
===============
03/23/2005 16:37:56;0001;   pbs_mom;Svr;pbs_mom;No child processes (10)
in is_update_stat, cannot specify protocol
03/23/2005 16:37:56;0001;   pbs_mom;Svr;pbs_mom;im_eof, Premature end of
message from addr xx.xx.xx.xx:15001
03/23/2005 16:38:26;0002;   pbs_mom;n/a;is_update_stat;hello sent to
server
===============

the last message is repeated 16 times before the "No child processes"
reappears.

<momctl -d 4 -h yy.yy.yy.yy> gives for the node:
=====
Host: yy.yy.yy.yy/yy.yy.yy.yy   Server: xx.xx.xx.xx   Version:
torque_1.2.0p1
HomeDirectory:          /usr/spool/PBS/mom_priv
MOM active:             4384 seconds
WARNING:  no messages received from server
Server Update Interval: 20 seconds
Server Update Interval: 20 seconds
WARNING:  no hello/cluster-addrs messages received from server
Init Msgs Sent:         139 hellos
LOGLEVEL:               0 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model:    RPP
TCP Timeout:            20 seconds
Prolog Alarm Time:      300 seconds
Alarm Time:             0 of 10 seconds
=====

I'd be grateful for any expertise in tackling this problem.

best regards,
gianfranco

On Wed, 2005-03-23 at 15:04, Jeremy Stout wrote:
> Hello. Over the weeknd, I noticed that the nodes on my cluster would
> disappear and come back every few minutes. When they would disappear,
> the status would often appear as "down". I've looked at the server/
> mom logs and have not been able to figure out what the problem is.
<snip>



More information about the torqueusers mailing list