[torqueusers] Nodes show up as down
Prakash Velayutham
prakash.velayutham at cchmc.org
Mon Jun 9 17:56:47 MDT 2008
Hi All,
Just wanted to update the list that this issue is now resolved.
I had to compile Torque separately on the server and the MOM, even
though the underlying CPU arch is the same (opteron).
Thanks,
Prakash
On Jun 9, 2008, at 4:20 PM, Prakash Velayutham wrote:
> Hi,
>
> I am having trouble with Torque server and MOM running in a test
> setup. I have tested in 2.3.0 and 2.4.0 and both have the same
> issue, so I think it is something with the configuration rather than
> a bug.
>
> The only node shows up as down in "pbsnodes" for some strange reason.
>
> This is what I notice:
>
> MOM side:
>
> bmiwebd1:~ # /usr/local/torque-2.4.0/sbin/momctl -d 3
>
> Host: bmiwebd1/bmiwebd1.cluster.cchmc.org Version: 2.4.0-snap.
> 200804241119 PID: 10566
> Server[0]: bmiclustersvc1.cchmc.org (205.142.199.238:15001)
> WARNING: no hello/cluster-addrs messages received from server
> Init Msgs Sent: 1 hellos
> WARNING: no messages received from server
> Last Msg To Server: 16 seconds
> HomeDirectory: /var/spool/torque/mom_priv
> stdout/stderr spool directory: '/var/spool/torque/spool/' (9789490
> blocks available)
> NOTE: syslog enabled
> HomeDirectory: /var/spool/torque/mom_priv
> MOM active: 16 seconds
> Server Update Interval: 45 seconds
> LogLevel: 9 (use SIGUSR1/SIGUSR2 to adjust)
> Communication Model: RPP
> MemLocked: TRUE (mlock)
> Prolog: /var/spool/torque/mom_priv/prologue (disabled)
> Alarm Time: 0 of 10 seconds
> Trusted Client List: 205.142.199.238,192.168.1.14,127.0.0.1
> Copy Command: /usr/bin/scp -rpB
> NOTE: no local jobs detected
>
> diagnostics complete
>
>
> Server logs:
>
> 06/09/2008 16:13:55;0006;PBS_Server;Svr;PBS_Server;Server
> bmiclustersvc1.cchmc.org started, initialization type = 1
> 06/09/2008 16:13:55;0002;PBS_Server;Svr;Act;Account file /var/spool/
> torque/server_priv/accounting/20080609 opened
> 06/09/2008 16:13:55;0040;PBS_Server;Req;setup_nodes;setup_nodes()
> 06/09/2008 16:13:55;0002;PBS_Server;Svr;PBS_Server;Expected 0,
> recovered 0 queues
> 06/09/2008 16:13:55;0002;PBS_Server;Svr;PBS_Server;Expected 0,
> recovered 0 jobs
> 06/09/2008 16:13:55;0006;PBS_Server;Svr;PBS_Server;Using ports
> Server:15001 Scheduler:15004 MOM:15002
> 06/09/2008 16:13:55;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid
> = 6734, loglevel=0
> 06/09/2008 16:14:00;0040;PBS_Server;Req;ping_nodes;ping attempting
> to contact 1 nodes
> 06/09/2008 16:14:00;0040;PBS_Server;Req;ping_nodes;successful ping
> to node bmiwebd1 (stream 1)
> 06/09/2008 16:14:06;0001;PBS_Server;Svr;PBS_Server;stream_eof,
> connection to bmiwebd1 is bad, remote service may be down, message
> may be corrupt, or connection may have been dropped remotely
> (Premature end of message). setting node state to down
>
> Thanks,
> Prakash
>
> Prakash Velayutham
> Programmer / Analyst
> Cincinnati Children's Hospital Medical Center
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
Prakash Velayutham
Programmer / Analyst
Cincinnati Children's Hospital Medical Center
More information about the torqueusers
mailing list