[torqueusers] nodes doen and up because of lost connetion

Arnau Bria arnaubria at pic.es
Wed May 27 05:19:07 MDT 2009


Hi all,

I've added few new nodes to my cluster and now I see many messages like:
May 27 12:50:58 pbs02 PBS_Server: stream_eof, connection to td071.pic.es is bad, remote service may be down, message may be corrupt, or connection may have been dropped remotely (End of File).  setting node state to down

coming from ALL hosts.

And nodes are marked as down  randomly:

[root at pbs02 ~]# pbsnodes -l
[root at pbs02 ~]# pbsnodes -l
td163.pic.es         down
[root at pbs02 ~]# pbsnodes -l
td051.pic.es         down


If I remove new nodes the message still appears, so it's not a new
nodes issue.

We have 190 nodes:

# wc -l /var/spool/pbs/server_priv/nodes
190 /var/spool/pbs/server_priv/nodes


I've asked to network people if is there any net problem and they say
no.. so I don't know where this problem could come from.

I've restarted pbs_server and pbs_moms.

And If I go to one of the nodes and run momctl:

# momctl -d 3

Host: td112.pic.es/td112.pic.es   Version: 2.3.0-snap.200801151629   PID: 24696
Server[0]: pbs02.pic.es (193.109.174.37)
  Init Msgs Received:     0 hellos/1 cluster-addrs
  Init Msgs Sent:         1 hellos
  Last Msg From Server:   24 seconds (StatusJob)
  Last Msg To Server:     36 seconds
HomeDirectory:          /var/spool/pbs/mom_priv
stdout/stderr spool directory: '/var/spool/pbs/spool/' (318554 blocks available)
NOTE:  syslog enabled
MOM active:             39 seconds
Server Update Interval: 45 seconds
LogLevel:               0 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model:    RPP
MemLocked:              TRUE  (mlock)
TCP Timeout:            20 seconds
Prolog:                 /var/spool/pbs/mom_priv/prologue (disabled)
Alarm Time:             0 of 10 seconds
Trusted Client List:    193.109.173.6,193.109.173.26,193.109.173.27,193.109.173.56,193.109.173.66,193.109.173.69,193.109.173.68,193.109.173.67,193.109.173.65,193.109.173.64,193.109.173.63,193.109.173.62,193.109.173.61,193.109.173.60,193.109.173.111,193.109.173.110,193.109.173.109,193.109.173.108,193.109.173.107,193.109.173.106,193.109.173.105,193.109.173.104,193.109.173.103,193.109.173.102,193.109.173.101,193.109.173.100,193.109.173.99,193.109.173.98,193.109.173.97,193.109.173.96,193.109.173.95,193.109.173.94,193.109.173.93,193.109.173.92,193.109.173.91,193.109.173.90,193.109.173.89,193.109.173.88,193.109.173.87,193.109.173.85,193.109.173.84,193.109.173.83,193.109.173.82,193.109.173.81,193.109.173.80,193.109.173.79,193.109.173.59,193.109.173.58,193.109.173.57,193.109.173.55,193.109.173.54,193.109.173.53,193.109.173.52,193.109.173.51,193.109.173.50,193.109.173.49,193.109.173.48,193.109.173.47,193.109.173.46,193.109.173.45,193.109.173.44,193.109.173.43,193.109.173.42,193.109.17
 3.25,193.109.173.24,193.109.173.23,193.109.173.144
Copy Command:           /usr/bin/scp -rpB
job[4458005.pbs02.pic.es]  state=RUNNING  sidlist=15681
job[4457752.pbs02.pic.es]  state=RUNNING  sidlist=564
job[4457781.pbs02.pic.es]  state=RUNNING  sidlist=5523
job[4457606.pbs02.pic.es]  state=RUNNING  sidlist=27087
Assigned CPU Count:     4

diagnostics complete


seems ok...



I've looked for this errro in list and checked all preivous
recommendations, so at this point I'm really really lost.


Any one who experienced it before could give me few clues?

TIA,
Arnau


More information about the torqueusers mailing list