[torqueusers] pbsnodes says nodes down, PBS_Server connection to node is bad

Ken Nielson knielson at adaptivecomputing.com
Mon Aug 1 08:54:35 MDT 2011


Steven,

Which version of TORQUE are you running.

Ken Nielson
Adaptive Computing

----- Original Message -----
> From: "StevenX A DuChene" <stevenx.a.duchene at intel.com>
> To: torqueusers at supercluster.org
> Sent: Friday, July 29, 2011 3:30:08 PM
> Subject: [torqueusers] pbsnodes says nodes down, PBS_Server connection to node is bad
> I am seeing the following in my logs:
> 
> Jul 29 14:21:20 extern01 PBS_Server: LOG_ERROR::stream_eof, connection
> to node126 is bad, remote service may be down, message may be corrupt,
> or connection may have been dropped remotely (End of File). setting
> node state to down
> Jul 29 14:21:20 extern01 PBS_Server: LOG_ERROR::stream_eof, connection
> to node243 is bad, remote service may be down, message may be corrupt,
> or connection may have been dropped remotely (End of File). setting
> node state to down
> Jul 29 14:21:20 extern01 PBS_Server: LOG_ERROR::stream_eof, connection
> to node140 is bad, remote service may be down, message may be corrupt,
> or connection may have been dropped remotely (End of File). setting
> node state to down
> Jul 29 14:21:21 extern01 PBS_Server: LOG_ERROR::stream_eof, connection
> to node085 is bad, remote service may be down, message may be corrupt,
> or connection may have been dropped remotely (End of File). setting
> node state to down
> 
> I see this for all of my node. Plus pbsnodes says the state of all of
> the nodes is down.
> 
> However when I run the following:
> 
> # momctl -d 3 -h atomnode085
> 
> Host: node085.enour.at/node085.enour.at Version: 2.5.7 PID: 24400
> Server[0]: master1 (192.168.121.4:15001)
> Init Msgs Received: 1 hellos/0 cluster-addrs
> Init Msgs Sent: 1205 hellos
> Last Msg From Server: 5925 seconds (HELLO)
> Last Msg To Server: 76 seconds
> HomeDirectory: /var/spool/torque/mom_priv
> stdout/stderr spool directory: '/var/spool/torque/spool/' (3065857
> blocks available)
> NOTE: syslog enabled
> MOM active: 9680 seconds
> Check Poll Time: 45 seconds
> Server Update Interval: 45 seconds
> LogLevel: 0 (use SIGUSR1/SIGUSR2 to adjust)
> Communication Model: RPP
> MemLocked: TRUE (mlock)
> TCP Timeout: 20 seconds
> Prolog: /var/spool/torque/mom_priv/prologue (disabled)
> Alarm Time: 0 of 10 seconds
> Trusted Client List: 192.168.121.4,192.168.121.100,127.0.0.1
> Copy Command: /usr/bin/scp -rpB
> NOTE: no local jobs detected
> 
> diagnostics complete
> 
> All of the pbs_mom processes are running on the nodes and their config
> file in /var/spool/torque/mom_privs points to the correct server name.
> 
> Is this in any way related to the Unauthorized Request problem I
> posted about a little while ago?
> --
> Steven DuChene
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list