[torqueusers] nodes switching back to state down

Schulz, Henrik H.Schulz at fz-rossendorf.de
Wed Jan 18 02:26:11 MST 2006


Actually, I am using only one server. The other two are configured for the case that I have to switch to another server, but I never do. I will delete the configuration for the other servers.

I did the following (which seems to help): I removed $clienthost in mom_priv/config and added $pbsserver. Now it works:

[root at hn253 schulzh]# /usr/torque/sbin/momctl -d 0 -h cn47

Host: cn47/cn47   Version: 2.0.0p4
Server[0]: hn253.fz-cluster.de (connection is active)
  Last Msg From Server:   38 seconds (StatusJob)
  Last Msg To Server:     45 seconds
HomeDirectory:          /var/spool/torque/mom_priv
MOM active:             1669 seconds
LOGLEVEL:               0 (use SIGUSR1/SIGUSR2 to adjust)
Job[487.hn253]  State=RUNNING
Assigned CPU Count:     4

diagnostics complete

Thanks Garrick!

-----Ursprüngliche Nachricht-----
Von: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] Im Auftrag von Garrick Staples
Gesendet: Dienstag, 17. Januar 2006 23:42
An: torqueusers at supercluster.org
Betreff: Re: [torqueusers] nodes switching back to state down

> 3. While the node is down in such a way, mom_ctl says it has connection to the server:
> 
> Host: cn49/cn49   Version: 2.0.0p4
> Server[0]: hn254 (connection is active)
>   WARNING:  no hello/cluster-addrs messages received from server
>   Init Msgs Sent:         54149 hellos
>   Last Msg From Server:   237319 seconds (DeleteJob)
>   Last Msg To Server:     2 seconds
> Server[1]: hn253 (connection is active)
>   Last Msg From Server:   18505 seconds (CLUSTER_ADDRS)
>   Last Msg To Server:     2 seconds
> Server[2]: hn252 (connection is active)
>   WARNING:  no hello/cluster-addrs messages received from server
>   Init Msgs Sent:         54149 hellos
>   WARNING:  no messages received from server
>   Last Msg To Server:     2 seconds
> HomeDirectory:          /var/spool/torque/mom_priv
> MOM active:             593473 seconds
> LOGLEVEL:               0 (use SIGUSR1/SIGUSR2 to adjust)
> JobList:                NONE
> 
> diagnostics complete
 

Actually, the "connection is active" message is wrong.  It always says
that.

Why do you have 3 servers configured?

The MOM is not getting any messages from the server.  That is bad.  Your
MOM is sending HELLO messages to your 3 servers and the server isn't
getting back to MOM.

Is there a problem with forward/reverse resolution with your nodes?

 
> 4. The pbsserver name is set in /var/spool/torque/server.name and
> there is the variable $clienthost set in
> /var/spool/torque/mom_priv/config. Is there another variable needed?

You want the server name in "/var/spool/torque/server_name".  If you
have 1 server, that's all you need.  If you have multiple servers, then
list all of them with $pbsserver.  $clienthost is depricated.

 




More information about the torqueusers mailing list