[torquedev] Re: binary change to .JB files in 2.3-fixes branch!

Josh Butikofer josh at clusterresources.com
Mon Mar 30 12:00:34 MDT 2009


Glen and everyone else who's interested:

First of all, let me explain what is going on with TORQUE 2.3. We have been
developing a lot of enhancements and bug fixes in a separate 2.3 branch. Many of
these fixes and enhancements we considered either very important or very
beneficial. Because these changes have been used and tested thoroughly in
several environments, we felt that a good number of them could be put into
TORQUE 2.3.7. We started this migration process last week. We were planning on
announcing all of these changes on the mailing list and asking developers/users
to review/test them. This is akin to peer review and beta testing--and it looks
like you've already started the peer review part. :-)

We haven't released TORQUE 2.3.7 yet because, we know, that any change in the
code increases the chance for bugs. We weren't planning on releasing 2.3.7
without more testing and review from the community.

Some changes that went into 2.3.7 last week during our huge merge operation
shouldn't have. Some are because of oversights on our part. Some are simply
accidental. For example, this HOSTNAME change was an oversight. Another change
that is too significant for 2.3.7 is the change to sockets. This will be removed
as well, but will be present in TORQUE 2.4. There may be others, and we will
need to look at all of them again.

Let me know if I'm wrong, but it seems the core of what you are suggesting is this:

* TORQUE 2.3.x should not include anything but minor bug fixes.
* All new features, enhancements, and more intrusive bug fixes should go into a
non-stable branch, which is now called trunk.

Up to this point, CRI's developers have been operating under a slightly
different model:

* TORQUE 2.3.x can include bug fixes, enhancements and features that can be
easily turned off or on (do not affect default behavior). We want to keep the
branch stable, but also feel it is important to continue to make the product
better for users, without having to wait months and months for the next major
release to come out.

* "Trunk" (or TORQUE 2.4) can have pretty much any big or small change put into
it that is deemed unfit for TORQUE 2.3.x. I've always felt that this is ...
dangerous and unwieldy.

It is obvious that there are differing opinions on how TORQUE's development
should be handled. So it seems to me that we need to come up with some specific
guidelines and hold everyone to them to avoid such situations in the future. But
I think it is important that all the stakeholders get a say in those guidelines.
We should also, perhaps, send a posting to the TORQUE development list before
doing any check-ins or merges to double-check that our change is a worthy one,
without any risks or concerns that we haven't thought of. What do you think?

As you know, CRI has customers who depend on our ability to deliver important
features and bug fixes to users in stable branches of TORQUE--in a timely
manner. I also see that other customers or users just want TORQUE to remain
stable and "not fixed, unless its broken." We need to find a way to satisfy both
scenarios.

Glen Beane wrote:
> the change
> 
> #define PBS_MAXHOSTNAME  64 /* max host name length */
> 
> to
> 
> #define PBS_MAXHOSTNAME  1024 /* max host name length */
> 
> results in a change in the size of the ji_qs struct, which is what is
> saved in the .JB file.  This requires adding support for this upgrade
> to job_qs_upgrade so existing .JB files get upgraded to the new struct
> layout after a TORQUE upgrade, and it would be impossible to downgrade
> to a previous 2.3 release without draining the system of running jobs.> 
> this is how the size of ji_jobid in the ji_qs struct is defined:
> 
> #define PBS_MAXSERVERNAME PBS_MAXHOSTNAME /* max server name length */
> #define PBS_MAXSVRJOBID  (PBS_MAXSEQNUM + PBS_MAXSERVERNAME +
> PBS_MAXPORTNUM + PBS_MAXJOBARRAYLEN + 2 ) /* server job id size */
> 
> This change _needs_ to be pulled out of 2.3-fixes. We should not be
> making changes to this structure in "bug fix" releases.  I am going to
> change this back to 64 in 2.3-fixes, and leave it as 1024 in trunk.

I agree--it was a mistake to put this into TORQUE 2.3 due to the changing of the
job structure. It is proper to leave this in trunk, since some users have
hostnames that are larger than 64 bytes, and RFC's say hostnames can be up to
255 characters in length. Ideally, we should someday change the way TORQUE
stores job info to make it less brittle.

> Also, we really should not be adding new features into 2.3-fixes
> (accounting_keep_days, log_keep_days, lock_file).

Again, I think there is a differing philosophy here and we need to have some
more discussion to decide what our guidelines will be.

Regards,

Josh Butikofer



More information about the torquedev mailing list