[torqueusers] Nodes show up as down

Prakash Velayutham prakash.velayutham at cchmc.org
Mon Jun 9 17:56:47 MDT 2008


Hi All,

Just wanted to update the list that this issue is now resolved.

I had to compile Torque separately on the server and the MOM, even  
though the underlying CPU arch is the same (opteron).

Thanks,
Prakash

On Jun 9, 2008, at 4:20 PM, Prakash Velayutham wrote:

> Hi,
>
> I am having trouble with Torque server and MOM running in a test  
> setup.  I have tested in 2.3.0 and 2.4.0 and both have the same  
> issue, so I think it is something with the configuration rather than  
> a bug.
>
> The only node shows up as down in "pbsnodes" for some strange reason.
>
> This is what I notice:
>
> MOM side:
>
> bmiwebd1:~ # /usr/local/torque-2.4.0/sbin/momctl -d 3
>
> Host: bmiwebd1/bmiwebd1.cluster.cchmc.org   Version: 2.4.0-snap. 
> 200804241119   PID: 10566
> Server[0]: bmiclustersvc1.cchmc.org (205.142.199.238:15001)
>  WARNING:  no hello/cluster-addrs messages received from server
>  Init Msgs Sent:         1 hellos
>  WARNING:  no messages received from server
>  Last Msg To Server:     16 seconds
> HomeDirectory:          /var/spool/torque/mom_priv
> stdout/stderr spool directory: '/var/spool/torque/spool/' (9789490  
> blocks available)
> NOTE:  syslog enabled
> HomeDirectory:          /var/spool/torque/mom_priv
> MOM active:             16 seconds
> Server Update Interval: 45 seconds
> LogLevel:               9 (use SIGUSR1/SIGUSR2 to adjust)
> Communication Model:    RPP
> MemLocked:              TRUE  (mlock)
> Prolog:                 /var/spool/torque/mom_priv/prologue (disabled)
> Alarm Time:             0 of 10 seconds
> Trusted Client List:    205.142.199.238,192.168.1.14,127.0.0.1
> Copy Command:           /usr/bin/scp -rpB
> NOTE:  no local jobs detected
>
> diagnostics complete
>
>
> Server logs:
>
> 06/09/2008 16:13:55;0006;PBS_Server;Svr;PBS_Server;Server  
> bmiclustersvc1.cchmc.org started, initialization type = 1
> 06/09/2008 16:13:55;0002;PBS_Server;Svr;Act;Account file /var/spool/ 
> torque/server_priv/accounting/20080609 opened
> 06/09/2008 16:13:55;0040;PBS_Server;Req;setup_nodes;setup_nodes()
> 06/09/2008 16:13:55;0002;PBS_Server;Svr;PBS_Server;Expected 0,  
> recovered 0 queues
> 06/09/2008 16:13:55;0002;PBS_Server;Svr;PBS_Server;Expected 0,  
> recovered 0 jobs
> 06/09/2008 16:13:55;0006;PBS_Server;Svr;PBS_Server;Using ports  
> Server:15001  Scheduler:15004  MOM:15002
> 06/09/2008 16:13:55;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid  
> = 6734, loglevel=0
> 06/09/2008 16:14:00;0040;PBS_Server;Req;ping_nodes;ping attempting  
> to contact 1 nodes
> 06/09/2008 16:14:00;0040;PBS_Server;Req;ping_nodes;successful ping  
> to node bmiwebd1 (stream 1)
> 06/09/2008 16:14:06;0001;PBS_Server;Svr;PBS_Server;stream_eof,  
> connection to bmiwebd1 is bad, remote service may be down, message  
> may be corrupt, or connection may have been dropped remotely  
> (Premature end of message).  setting node state to down
>
> Thanks,
> Prakash
>
> Prakash Velayutham
> Programmer / Analyst
> Cincinnati Children's Hospital Medical Center
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

Prakash Velayutham
Programmer / Analyst
Cincinnati Children's Hospital Medical Center



More information about the torqueusers mailing list