[torqueusers] Nodes show up as down
Prakash Velayutham
prakash.velayutham at cchmc.org
Tue Jun 10 20:38:56 MDT 2008
Hello All,
Now I have another problem.
I have 2 Torque (2.4.0) servers running with --ha option.
When servers started up I got
06/09/2008 20:19:55;0002;PBS_Server;Svr;Log;Log opened
06/09/2008 20:19:55;0006;PBS_Server;Svr;PBS_Server;Server
bmiclustersvc1.cchmc.org started, initialization type = 106/09/2008
20:19:55;0002;PBS_Server;Svr;Act;Account file /var/spool/torque/
server_priv/accounting/20080609 opened
06/09/2008 20:19:55;0040;PBS_Server;Req;setup_nodes;setup_nodes()
06/09/2008 20:19:55;0086;PBS_Server;Svr;PBS_Server;Recovered queue
default
06/09/2008 20:19:55;0002;PBS_Server;Svr;PBS_Server;Expected 1,
recovered 1 queues
06/09/2008 20:19:55;0002;PBS_Server;Svr;PBS_Server;Expected 0,
recovered 0 jobs
06/09/2008 20:19:55;0006;PBS_Server;Svr;PBS_Server;Using ports Server:
15001 Scheduler:15004 MOM:15002
06/09/2008 20:19:55;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid =
27378, loglevel=0
06/09/2008 20:20:00;0040;PBS_Server;Req;ping_nodes;ping attempting to
contact 1 nodes
06/09/2008 20:20:00;0040;PBS_Server;Req;ping_nodes;successful ping to
node bmiwebd1 (stream 2)
But after a while, server started complaining as below.
06/09/2008 20:59:55;0004;PBS_Server;Svr;check_nodes;node bmiwebd1 not
detected in 4410 seconds, marking node down
Still, "pbsnodes" is showing the node as free.
bmiclustersvc1:~ # pbsnodes -a
bmiwebd1
state = free
np = 1
ntype = cluster
status = opsys=linux,uname=Linux bmiwebd1 2.6.22.17-0.1-cluster
#1 SMP Thu Jun 5 01:06:15 EDT 2008 x86_64,sessions=? 0,nsessions=?
0
,nusers
=
0
,idletime
=
55528
,totmem
=
12054528kb
,availmem
=
11978200kb
,physmem
=
4054536kb
,ncpus
=
4
,loadave
=0.00,netload=553762323,state=free,jobs=,varattr=,rectime=1213151796
So, jobs do not get scheduled.
Any ideas why the node is showing up as free even though server says
it cannot contact the node?
Thanks,
Prakash
On Jun 10, 2008, at 8:10 PM, Chris Samuel wrote:
>
> ----- "Prakash Velayutham" <prakash.velayutham at cchmc.org> wrote:
>
>> I had to compile Torque separately on the server and the MOM, even
>> though the underlying CPU arch is the same (opteron).
>
> That's really odd, we've never had to do that here!
>
> Perhaps there is some configuration difference that configure
> picks up on the nodes ?
>
> cheers!
> Chris
> --
> Christopher Samuel - (03) 9925 4751 - Systems Manager
> The Victorian Partnership for Advanced Computing
> P.O. Box 201, Carlton South, VIC 3053, Australia
> VPAC is a not-for-profit Registered Research Agency
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
Prakash Velayutham
Programmer / Analyst
Cincinnati Children's Hospital Medical Center
More information about the torqueusers
mailing list