[torqueusers] scalability of 2.1.2?

Donald Tripp dtripp at hawaii.edu
Tue Aug 8 18:00:34 MDT 2006


Could this be a ethernet saturation problem? The clients may be  
timming out.

- Don

On Aug 8, 2006, at 8:57 AM, Thomas G. Raisor wrote:

> Hi all,
>
> I am trying to get torque 2.1.2 running on our cluster, but it  
> seems once I reach a certain number of compute-nodes, pbs_server  
> gives up and marks everything down.
>
> Our cluster is in 4 pieces, 4 subnets, abc and d. The server runs  
> on a host on the b network. I add all the hosts in b and things  
> work great. I add a few compute nodes from the a network - things  
> work great. I add all the rest of the nodes for a, and everything  
> gets marked down - including all nodes on the b network, and they  
> never become free again.
>
> I have compiled using rpp and disable-rpp - same problem.
>
> momctl -d 3 reports no problems when things are working, when I  
> scale up past my b network, the output on added nodes gives the  
> warning: no hello/cluster-addrs messages received from server
>
> To this point I have only been able to successfully get up to about  
> 200 nodes of my 618 node cluster working. I have had it all working  
> on previous releases of torque and on RHEL 4U2 - I just upgraded to  
> RHEL 4U3 (platform rocks) and was updating torque at the same time  
> and so far have failed to get it working.
>
> Suggestions?
>
> Tom-
>
> pbs_server log contents:
>
> On startup:
> 12:29:36;0040;PBS_Server;Req;ping_nodes;successful ping to node  
> m4a-4-1i (stream 347)
>
> A few seconds later:
> 08/08/2006 12:39:06;0001;PBS_Server;Svr;PBS_Server;stream_eof,  
> connection to m4a-4-1i is bad, remote service may be down, message  
> may be corrupt, or connection may have been dropped remotely  
> (Premature end of message).  setting node state to down
>
> AND later still:
> 08/08/2006 12:41:32;0004;PBS_Server;Svr;check_nodes;node m4a-4-1i  
> not detected in 1155062492 seconds, marking node down
>
>
> Mom logs:
> 08/08/2006 12:44:53;0002;   pbs_mom;Svr;Log;Log opened
> 08/08/2006 12:44:53;0002;   pbs_mom;Svr;usecp;*:/ibrix/home   / 
> ibrix/home
> 08/08/2006 12:44:53;0002;   pbs_mom;Svr;usecp;*:/state/partition1/ 
> home       /home
> 08/08/2006 12:44:53;0002;   pbs_mom;n/a;initialize;independent
> 08/08/2006 12:44:53;0002;   pbs_mom;Svr;pbs_mom;Is up
> 08/08/2006 12:44:53;0002;   pbs_mom;Svr;mom_main;MOM executable  
> path and mtime at launch: /usr/local/sbin/pbs_mom 1155061646
> 08/08/2006 12:44:53;0002;   pbs_mom;n/a;mom_main;hello sent to  
> server m4bi
>
>
> -- 
> ~-~-~-~-~-
> Tom Raisor
> HPC Systems Administrator
> Brigham Young University
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list