[torqueusers] pbs_server and nodes file how to handle comments
garrick at usc.edu
Wed Mar 29 16:24:24 MST 2006
On Wed, Mar 29, 2006 at 01:24:06PM +0100, David Golden alleged:
> On 2006-03-28 09:58:09 -0800, Garrick Staples wrote:
> > On Tue, Mar 28, 2006 at 01:46:44PM +0100, David Golden alleged:
> > > > That would be a frequency of 0. New nodes start in state unknown, get
> > > > pinged, and get an addr list. The old nodes never get the new addr list.
> > >
> > > Ah.
> > >
> > > Not that it's necessarily what you'd want to do (especially given your
> > > large-cluster avoiding-ping concerns and maybe iffy effect on running
> > > jobs, though jobs on nodes I tested on weren't interrupted):
> > > but if you "pbsnodes -r" on the old nodes to force them state=down,
> > > do they then get the updated node list and do something useful with
> > > it when they're noticed to be "back" online by the server?
> > Yes, setting a node to down will trigger a ping operation and it will
> > get a new addr list.
> > This is why a cluster-wide ping operation is needed to support creating
> > new nodes automatically.
> Well, point being that presumably one could therefore do a
> "pbsnodes -r node1 node2 node3 node4 ... nodeN" after adding
> the new nodes -i.e. bring every node in the cluster to
> state = down so they all get the new list? (you could do subsets
> at a time, too, for large clusters (especially if said cluster
> is split into nodesets, nodes in one set mightn't need to know
> about the nodes in another set immediately)) - maybe
> sledgehammer-for-a-nail, though then again maybe not: e.g. you
> mightn't want new parallel jobs issued to a node until you're
> sure it had the new node list.
I'm trying to come with the "simple and works in all cases" solution. I
don't think messing with the individual node states really satisfies
> There's also the not-being-able-to-do-everything-within-qmgr:
> but you could make node states settable within qmgr,
> then do much the same thing - i.e.
You can, 'set node XXXX state += down', but that is clunky,
error-prone and doesn't actually prevent the race condition.
> create new nodes
> set all nodes down
> clear new nodes offline
My proposal basicly does #2, but simpler:
create new nodes
'set server ping_nodes=T'
clear new nodes
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20060329/81697ae6/attachment.bin
More information about the torqueusers