[torqueusers] Problems with 2.0.0p2

Åke Sandgren ake.sandgren at hpc2n.umu.se
Tue Nov 29 02:13:08 MST 2005


Hi!

Since upgrading to 2.0.0p2 we have been getting problems with server/mom
communication.

The server believes that nodes are down and momctl -d 0 -h node gives
this kind of output.

Host: k114.hpc2n.umu.se/k114.hpc2n.umu.se   Version: 2.0.0p2
Server[0]: namnam.hpc2n.umu.se (connection is active)
  WARNING:  no hello/cluster-addrs messages received from server
  Init Msgs Sent:         50132 hellos
  Last Msg From Server:   81548 seconds (DeleteJob)
  Last Msg To Server:     0 seconds
Server[1]: namnam-k.hpc2n.umu.se (connection is active)
  WARNING:  no hello/cluster-addrs messages received from server
  Init Msgs Sent:         48818 hellos
  Last Msg From Server:   1386 seconds (CLUSTER_ADDRS)
  Last Msg To Server:     0 seconds
HomeDirectory:          /var/spool/PBS/mom_priv
MOM active:             611217 seconds
LOGLEVEL:               6 (use SIGUSR1/SIGUSR2 to adjust)
JobList:                NONE


If i restart pbs_server the SAME set of nodes will usually show up as
down after a while. There is absolutely nothing wrong with these nodes.

This shows up on both clusters where we are running 2.0.0p2.

Anyone else seen this behaviour?

The nodes are located on the namnam-k interface but the actual
servername is namnam (since that is taken from hostname...) hence the
dual server config.


More information about the torqueusers mailing list