[torqueusers] pbs_server and nodes file how to handle comments

Garrick Staples garrick at usc.edu
Wed Mar 29 16:24:24 MST 2006


On Wed, Mar 29, 2006 at 01:24:06PM +0100, David Golden alleged:
> On 2006-03-28 09:58:09 -0800, Garrick Staples wrote:
> > On Tue, Mar 28, 2006 at 01:46:44PM +0100, David Golden alleged:
> > > > That would be a frequency of 0.  New nodes start in state unknown, get
> > > > pinged, and get an addr list.  The old nodes never get the new addr list.
> > > 
> > > Ah.
> > > 
> > > Not that it's necessarily what you'd want to do (especially given your
> > > large-cluster avoiding-ping concerns and maybe iffy effect on running 
> > > jobs, though jobs on nodes I tested on weren't interrupted): 
> > > but if you "pbsnodes -r" on the old nodes to force them state=down, 
> > > do they then get the updated node list and do something useful with
> > > it when they're noticed to be "back" online by the server? 
> > 
> 
> > Yes, setting a node to down will trigger a ping operation and it will
> > get a new addr list.
> >
> > This is why a cluster-wide ping operation is needed to support creating
> > new nodes automatically.
> >
> 
> Well, point being that presumably one could therefore do a
> "pbsnodes -r node1 node2 node3 node4 ... nodeN" after adding
> the new nodes  -i.e. bring every node in the cluster to 
> state = down so they all get the new list? (you could do subsets
> at a time, too, for large clusters (especially if said cluster
> is split into nodesets, nodes in one set mightn't need to know
> about the nodes in another set immediately)) - maybe 
> sledgehammer-for-a-nail, though then again maybe not: e.g. you 
> mightn't want new parallel jobs issued to a  node until you're 
> sure it had the new node list. 

I'm trying to come with the "simple and works in all cases" solution.  I
don't think messing with the individual node states really satisfies
that.

 
> There's also the not-being-able-to-do-everything-within-qmgr:
> but you could make node states settable within qmgr,
> then do much the same thing - i.e.

You can, 'set node XXXX state += down', but that is clunky,
error-prone and doesn't actually prevent the race condition.

 
> create new nodes 
> set all nodes down
> clear new nodes offline 

My proposal basicly does #2, but simpler:

create new nodes
'set server ping_nodes=T'
clear new nodes

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20060329/81697ae6/attachment.bin


More information about the torqueusers mailing list