[torqueusers] Nodes show up as down
Prakash Velayutham
prakash.velayutham at cchmc.org
Mon Jun 9 14:20:40 MDT 2008
Hi,
I am having trouble with Torque server and MOM running in a test
setup. I have tested in 2.3.0 and 2.4.0 and both have the same issue,
so I think it is something with the configuration rather than a bug.
The only node shows up as down in "pbsnodes" for some strange reason.
This is what I notice:
MOM side:
bmiwebd1:~ # /usr/local/torque-2.4.0/sbin/momctl -d 3
Host: bmiwebd1/bmiwebd1.cluster.cchmc.org Version: 2.4.0-snap.
200804241119 PID: 10566
Server[0]: bmiclustersvc1.cchmc.org (205.142.199.238:15001)
WARNING: no hello/cluster-addrs messages received from server
Init Msgs Sent: 1 hellos
WARNING: no messages received from server
Last Msg To Server: 16 seconds
HomeDirectory: /var/spool/torque/mom_priv
stdout/stderr spool directory: '/var/spool/torque/spool/' (9789490
blocks available)
NOTE: syslog enabled
HomeDirectory: /var/spool/torque/mom_priv
MOM active: 16 seconds
Server Update Interval: 45 seconds
LogLevel: 9 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model: RPP
MemLocked: TRUE (mlock)
Prolog: /var/spool/torque/mom_priv/prologue (disabled)
Alarm Time: 0 of 10 seconds
Trusted Client List: 205.142.199.238,192.168.1.14,127.0.0.1
Copy Command: /usr/bin/scp -rpB
NOTE: no local jobs detected
diagnostics complete
Server logs:
06/09/2008 16:13:55;0006;PBS_Server;Svr;PBS_Server;Server
bmiclustersvc1.cchmc.org started, initialization type = 1
06/09/2008 16:13:55;0002;PBS_Server;Svr;Act;Account file /var/spool/
torque/server_priv/accounting/20080609 opened
06/09/2008 16:13:55;0040;PBS_Server;Req;setup_nodes;setup_nodes()
06/09/2008 16:13:55;0002;PBS_Server;Svr;PBS_Server;Expected 0,
recovered 0 queues
06/09/2008 16:13:55;0002;PBS_Server;Svr;PBS_Server;Expected 0,
recovered 0 jobs
06/09/2008 16:13:55;0006;PBS_Server;Svr;PBS_Server;Using ports Server:
15001 Scheduler:15004 MOM:15002
06/09/2008 16:13:55;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid =
6734, loglevel=0
06/09/2008 16:14:00;0040;PBS_Server;Req;ping_nodes;ping attempting to
contact 1 nodes
06/09/2008 16:14:00;0040;PBS_Server;Req;ping_nodes;successful ping to
node bmiwebd1 (stream 1)
06/09/2008 16:14:06;0001;PBS_Server;Svr;PBS_Server;stream_eof,
connection to bmiwebd1 is bad, remote service may be down, message may
be corrupt, or connection may have been dropped remotely (Premature
end of message). setting node state to down
Thanks,
Prakash
Prakash Velayutham
Programmer / Analyst
Cincinnati Children's Hospital Medical Center
More information about the torqueusers
mailing list