[torqueusers] nodes doen and up because of lost connetion
Arnau Bria
arnaubria at pic.es
Wed May 27 05:19:07 MDT 2009
Hi all,
I've added few new nodes to my cluster and now I see many messages like:
May 27 12:50:58 pbs02 PBS_Server: stream_eof, connection to td071.pic.es is bad, remote service may be down, message may be corrupt, or connection may have been dropped remotely (End of File). setting node state to down
coming from ALL hosts.
And nodes are marked as down randomly:
[root at pbs02 ~]# pbsnodes -l
[root at pbs02 ~]# pbsnodes -l
td163.pic.es down
[root at pbs02 ~]# pbsnodes -l
td051.pic.es down
If I remove new nodes the message still appears, so it's not a new
nodes issue.
We have 190 nodes:
# wc -l /var/spool/pbs/server_priv/nodes
190 /var/spool/pbs/server_priv/nodes
I've asked to network people if is there any net problem and they say
no.. so I don't know where this problem could come from.
I've restarted pbs_server and pbs_moms.
And If I go to one of the nodes and run momctl:
# momctl -d 3
Host: td112.pic.es/td112.pic.es Version: 2.3.0-snap.200801151629 PID: 24696
Server[0]: pbs02.pic.es (193.109.174.37)
Init Msgs Received: 0 hellos/1 cluster-addrs
Init Msgs Sent: 1 hellos
Last Msg From Server: 24 seconds (StatusJob)
Last Msg To Server: 36 seconds
HomeDirectory: /var/spool/pbs/mom_priv
stdout/stderr spool directory: '/var/spool/pbs/spool/' (318554 blocks available)
NOTE: syslog enabled
MOM active: 39 seconds
Server Update Interval: 45 seconds
LogLevel: 0 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model: RPP
MemLocked: TRUE (mlock)
TCP Timeout: 20 seconds
Prolog: /var/spool/pbs/mom_priv/prologue (disabled)
Alarm Time: 0 of 10 seconds
Trusted Client List: 193.109.173.6,193.109.173.26,193.109.173.27,193.109.173.56,193.109.173.66,193.109.173.69,193.109.173.68,193.109.173.67,193.109.173.65,193.109.173.64,193.109.173.63,193.109.173.62,193.109.173.61,193.109.173.60,193.109.173.111,193.109.173.110,193.109.173.109,193.109.173.108,193.109.173.107,193.109.173.106,193.109.173.105,193.109.173.104,193.109.173.103,193.109.173.102,193.109.173.101,193.109.173.100,193.109.173.99,193.109.173.98,193.109.173.97,193.109.173.96,193.109.173.95,193.109.173.94,193.109.173.93,193.109.173.92,193.109.173.91,193.109.173.90,193.109.173.89,193.109.173.88,193.109.173.87,193.109.173.85,193.109.173.84,193.109.173.83,193.109.173.82,193.109.173.81,193.109.173.80,193.109.173.79,193.109.173.59,193.109.173.58,193.109.173.57,193.109.173.55,193.109.173.54,193.109.173.53,193.109.173.52,193.109.173.51,193.109.173.50,193.109.173.49,193.109.173.48,193.109.173.47,193.109.173.46,193.109.173.45,193.109.173.44,193.109.173.43,193.109.173.42,193.109.17
3.25,193.109.173.24,193.109.173.23,193.109.173.144
Copy Command: /usr/bin/scp -rpB
job[4458005.pbs02.pic.es] state=RUNNING sidlist=15681
job[4457752.pbs02.pic.es] state=RUNNING sidlist=564
job[4457781.pbs02.pic.es] state=RUNNING sidlist=5523
job[4457606.pbs02.pic.es] state=RUNNING sidlist=27087
Assigned CPU Count: 4
diagnostics complete
seems ok...
I've looked for this errro in list and checked all preivous
recommendations, so at this point I'm really really lost.
Any one who experienced it before could give me few clues?
TIA,
Arnau
More information about the torqueusers
mailing list