[torqueusers] Nodes show up as down

Prakash Velayutham prakash.velayutham at cchmc.org
Mon Jun 9 14:20:40 MDT 2008


Hi,

I am having trouble with Torque server and MOM running in a test  
setup.  I have tested in 2.3.0 and 2.4.0 and both have the same issue,  
so I think it is something with the configuration rather than a bug.

The only node shows up as down in "pbsnodes" for some strange reason.

This is what I notice:

MOM side:

bmiwebd1:~ # /usr/local/torque-2.4.0/sbin/momctl -d 3

Host: bmiwebd1/bmiwebd1.cluster.cchmc.org   Version: 2.4.0-snap. 
200804241119   PID: 10566
Server[0]: bmiclustersvc1.cchmc.org (205.142.199.238:15001)
   WARNING:  no hello/cluster-addrs messages received from server
   Init Msgs Sent:         1 hellos
   WARNING:  no messages received from server
   Last Msg To Server:     16 seconds
HomeDirectory:          /var/spool/torque/mom_priv
stdout/stderr spool directory: '/var/spool/torque/spool/' (9789490  
blocks available)
NOTE:  syslog enabled
HomeDirectory:          /var/spool/torque/mom_priv
MOM active:             16 seconds
Server Update Interval: 45 seconds
LogLevel:               9 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model:    RPP
MemLocked:              TRUE  (mlock)
Prolog:                 /var/spool/torque/mom_priv/prologue (disabled)
Alarm Time:             0 of 10 seconds
Trusted Client List:    205.142.199.238,192.168.1.14,127.0.0.1
Copy Command:           /usr/bin/scp -rpB
NOTE:  no local jobs detected

diagnostics complete


Server logs:

06/09/2008 16:13:55;0006;PBS_Server;Svr;PBS_Server;Server  
bmiclustersvc1.cchmc.org started, initialization type = 1
06/09/2008 16:13:55;0002;PBS_Server;Svr;Act;Account file /var/spool/ 
torque/server_priv/accounting/20080609 opened
06/09/2008 16:13:55;0040;PBS_Server;Req;setup_nodes;setup_nodes()
06/09/2008 16:13:55;0002;PBS_Server;Svr;PBS_Server;Expected 0,  
recovered 0 queues
06/09/2008 16:13:55;0002;PBS_Server;Svr;PBS_Server;Expected 0,  
recovered 0 jobs
06/09/2008 16:13:55;0006;PBS_Server;Svr;PBS_Server;Using ports Server: 
15001  Scheduler:15004  MOM:15002
06/09/2008 16:13:55;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid =  
6734, loglevel=0
06/09/2008 16:14:00;0040;PBS_Server;Req;ping_nodes;ping attempting to  
contact 1 nodes
06/09/2008 16:14:00;0040;PBS_Server;Req;ping_nodes;successful ping to  
node bmiwebd1 (stream 1)
06/09/2008 16:14:06;0001;PBS_Server;Svr;PBS_Server;stream_eof,  
connection to bmiwebd1 is bad, remote service may be down, message may  
be corrupt, or connection may have been dropped remotely (Premature  
end of message).  setting node state to down

Thanks,
Prakash

Prakash Velayutham
Programmer / Analyst
Cincinnati Children's Hospital Medical Center



More information about the torqueusers mailing list