[torquedev] [torqueusers] Question about what does PBS_NUM_NODES and PBS_NUM_PPN means
siegert at sfu.ca
Wed Dec 8 15:28:52 MST 2010
On Tue, Dec 07, 2010 at 10:06:23PM -0500, Glen Beane wrote:
> On Tue, Dec 7, 2010 at 4:55 PM, Martin Siegert <siegert at sfu.ca> wrote:
> > On Tue, Dec 07, 2010 at 01:26:20PM -0500, Glen Beane wrote:
> >> On Tue, Dec 7, 2010 at 1:18 PM, David Beer <dbeer at adaptivecomputing.com> wrote:
> >> >
> >> >> the customer isn't always right ;)
> >> >>
> >> >> really, I don't think we should pollute the codebase with hacks for
> >> >> specific customers when there may be a better more general way to do
> >> >> something that will have wider use
> >> >
> >> > I also wish that every time I had to solve a problem for a customer I had time to flush the idea out with the community, discover the best, most widely applicable solution, and then code that. Unfortunately, that is rarely the case. I believe we've made strong efforts to get the community more involved - I know we still can improve in this - but situations will always arise that just need to be fixed. It's not ideal but it happens.
> >> maybe we could keep those type of changes in a branch, or maybe give
> >> that customer a patch to solve their immediate need while we work on a
> >> more robust solution to push into torque? I'm not saying things will
> >> be perfect, but adding lots and lots of quick-fixes to satisfy a very
> >> small number of sites makes the code more complicated and harder to
> >> maintain.
> > Frankly, I would not like that at all.
> > Two cases from the recent past:
> > 1) I submitted a patch that would implement an environment variable
> > PBS_NCPUS that would contain the number of processors assigned to
> > the job. It was rejected because of the vague possibility that
> > sometime in the future there maybe support for dynamically sized
> > jobs. Even though the patch was tiny and I couldn't care less, if
> > PBS_NCPUS would have to be redefined sometime in the vague future
> > to be "initial value of ...".
> > 2) I submitted a patch that would allow routing based on a node
> > specification -l nodes=x1:ppn=y1+x2:ppn=y2+... by calculation the
> > sum x1*y1+x2*y2+... That patch was rejected since this would be
> > fixed some time in the future anyway.
> > By now I learned that I should not have submitted the patches to
> > torque-dev, but to Moab support.
> TORQUE patches should be submitted to bugzilla. I'm not sure why you
> think they should be submitted to Moab support.
Sorry, this was probably inappropriate, but born out of frustration:
I had submitted the PBS_NCPUS patch to the torque-dev list. It was
rejected. Now I saw that PBS_NUM_NODES and PBS_NUM_PPN got implemented.
In nature these two environment variables a very similar to PBS_NCPUS,
in fact, they only work with a nodes specification that does not
contain "+", whereas PBS_NCPUS works with any nodes, procs specification
(it's just vnodenum). Thus, I concluded that if I had submitted the
patch to moab support (we do have a support contract) the patch may
have been accepted. Thus, for now I am left with keeping my own set
of patches around.
(in short: I agree with your statement in principle, but ...)
> > Where should that lead to? Everybody keeps their own little patches
> > around, Adaptive Computing keeps their patches and nothing gets
> > implemented in torque?
> no no no, I'm not arguing for that AT ALL. I'm just saying if it is
> a quick fix to satisfy the need for ONE customer then does it need to
> be checked into the mainline TORQUE, especially if there is a more
> general solution that might benefit many sites? I'm just arguing that
> we should at least discuss some of these publicly before they are
> implemented in TORQUE. If Adaptive has input from more than just the
> one customer that request the original change then maybe we could end
> up with a better solution that many people might find useful.
More information about the torquedev