[torqueusers] Nodes show up as down

Prakash Velayutham prakash.velayutham at cchmc.org
Tue Jun 10 20:38:56 MDT 2008


Hello All,

Now I have another problem.

I have 2 Torque (2.4.0) servers running with --ha option.

When servers started up I got

06/09/2008 20:19:55;0002;PBS_Server;Svr;Log;Log opened
06/09/2008 20:19:55;0006;PBS_Server;Svr;PBS_Server;Server  
bmiclustersvc1.cchmc.org started, initialization type = 106/09/2008  
20:19:55;0002;PBS_Server;Svr;Act;Account file /var/spool/torque/ 
server_priv/accounting/20080609 opened
06/09/2008 20:19:55;0040;PBS_Server;Req;setup_nodes;setup_nodes()
06/09/2008 20:19:55;0086;PBS_Server;Svr;PBS_Server;Recovered queue  
default
06/09/2008 20:19:55;0002;PBS_Server;Svr;PBS_Server;Expected 1,  
recovered 1 queues
06/09/2008 20:19:55;0002;PBS_Server;Svr;PBS_Server;Expected 0,  
recovered 0 jobs
06/09/2008 20:19:55;0006;PBS_Server;Svr;PBS_Server;Using ports Server: 
15001  Scheduler:15004  MOM:15002
06/09/2008 20:19:55;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid =  
27378, loglevel=0
06/09/2008 20:20:00;0040;PBS_Server;Req;ping_nodes;ping attempting to  
contact 1 nodes
06/09/2008 20:20:00;0040;PBS_Server;Req;ping_nodes;successful ping to  
node bmiwebd1 (stream 2)

But after a while, server started complaining as below.

06/09/2008 20:59:55;0004;PBS_Server;Svr;check_nodes;node bmiwebd1 not  
detected in 4410 seconds, marking node down

Still, "pbsnodes" is showing the node as free.

bmiclustersvc1:~ # pbsnodes -a
bmiwebd1
      state = free
      np = 1
      ntype = cluster
      status = opsys=linux,uname=Linux bmiwebd1 2.6.22.17-0.1-cluster  
#1 SMP Thu Jun 5 01:06:15 EDT 2008 x86_64,sessions=? 0,nsessions=?  
0 
,nusers 
= 
0 
,idletime 
= 
55528 
,totmem 
= 
12054528kb 
,availmem 
= 
11978200kb 
,physmem 
= 
4054536kb 
,ncpus 
= 
4 
,loadave 
=0.00,netload=553762323,state=free,jobs=,varattr=,rectime=1213151796

So, jobs do not get scheduled.

Any ideas why the node is showing up as free even though server says  
it cannot contact the node?

Thanks,
Prakash


On Jun 10, 2008, at 8:10 PM, Chris Samuel wrote:

>
> ----- "Prakash Velayutham" <prakash.velayutham at cchmc.org> wrote:
>
>> I had to compile Torque separately on the server and the MOM, even
>> though the underlying CPU arch is the same (opteron).
>
> That's really odd, we've never had to do that here!
>
> Perhaps there is some configuration difference that configure
> picks up on the nodes ?
>
> cheers!
> Chris
> -- 
> Christopher Samuel - (03) 9925 4751 - Systems Manager
> The Victorian Partnership for Advanced Computing
> P.O. Box 201, Carlton South, VIC 3053, Australia
> VPAC is a not-for-profit Registered Research Agency
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

Prakash Velayutham
Programmer / Analyst
Cincinnati Children's Hospital Medical Center



More information about the torqueusers mailing list