Bug 60 - Linux code to read /proc/self/exe breaks mom auto-update if directory symlinks are involved
: Linux code to read /proc/self/exe breaks mom auto-update if directory symlink...
Status: ASSIGNED
Product: TORQUE
pbs_mom
: 2.4.x
: Other Linux
: P5 enhancement
Assigned To: Michael Jennings
:
:
:
  Show dependency treegraph
 
Reported: 2010-05-14 01:53 MDT by Chris Samuel
Modified: 2013-02-11 17:41 MST (History)
5 users (show)

See Also:


Attachments
Proposed fix for the bug (22.90 KB, patch)
2013-01-20 22:56 MST, Michael Jennings
Details | Diff
Patch against 4.1.4 tarball (23.41 KB, patch)
2013-01-23 10:14 MST, Michael Jennings
Details | Diff
Patches libutil Makefile.in (1.18 KB, patch)
2013-02-05 22:35 MST, Craig West
Details | Diff


Note

You need to log in before you can comment on or make changes to this bug.


Description Chris Samuel 2010-05-14 01:53:34 MDT
In src/resmom/mom_main.c there is a function MOMFindMyExe() which attempts to
locate the path to the executable of pbs_mom.  On Linux this returns the
contents of /proc/self/exe which breaks on systems which do versioning of
Torque installs.

For instance, at VPAC and here at VLSCI we install Torque into (for instance):

  /usr/local/torque/2.4.6

We then have a symlink from /usr/local/torque/latest to the current version in
use, and we start /usr/local/torque/latest/sbin/pbs_mom in our scripts.

What want is that path to be checked for a newer version, not the dereferenced
one which is in /proc/self/exe.  For a while I've been using a patch which just
#if 0's out the readlink() of /proc/self/exe to make it fall through to the
usual code which looks at argv[0] and that works fine.

Ideally this would be a runtime option specified in the pbs_mom config file,
defaulting to the current behaviour, but switchable when the admin realises
it's not working for the above reason. :-)
Comment 1 Glen 2010-11-18 20:59:44 MST
why keep the old behavior at all?  why not just check argv[0] all the time?
Comment 2 Garrick Staples 2010-11-18 22:44:05 MST
(In reply to comment #1)
> why keep the old behavior at all?  why not just check argv[0] all the time?

Because argv[0] might not be in PATH. 

To avoid another option in torque, just have your script do:
  V=$(readlink /blah/default)
  /blah/$V/bin/pbs_mom
Comment 3 Chris Samuel 2010-11-18 22:50:16 MST
I think we'd need to use readlink -f as readlink doesn't return anything
(because pbs_mom isn't a symlink, it's a directory higher up in the tree
which is).

But yes, that's an interesting idea..
Comment 4 Garrick Staples 2010-11-18 23:18:40 MST
(In reply to comment #3)
> I think we'd need to use readlink -f as readlink doesn't return anything
> (because pbs_mom isn't a symlink, it's a directory higher up in the tree
> which is).
> 
> But yes, that's an interesting idea..

Actually, isn't there an option to set the exec path?
Comment 5 David Beer 2010-11-19 09:14:48 MST
(In reply to comment #2)
> (In reply to comment #1)
> > why keep the old behavior at all?  why not just check argv[0] all the time?
> 
> Because argv[0] might not be in PATH. 
> 
> To avoid another option in torque, just have your script do:
>   V=$(readlink /blah/default)
>   /blah/$V/bin/pbs_mom

If you look at OriginalPath in the pbs_server code (main function) you may be
able to use that to fix the problem. I had to save Original Path to get the
TORQUE high availability feature working properly.
Comment 6 Craig West 2012-12-04 21:16:39 MST
This is still an issue in Torque 4.1.x 

Our local fix is still to do what is done in the first post.
Comment 7 Michael Jennings 2013-01-20 22:56:11 MST
Created an attachment (id=127) [details]
Proposed fix for the bug

Proposed fix.  I've generalized David's OriginalPath technique, cleaned up and
streamlined some of the code, and applied it to both pbs_server and pbs_mom.

Please let me know if this resolves the situation.  If you need help
backporting the fix to whatever version you're currently using, let me know.
Comment 8 Michael Jennings 2013-01-20 23:04:32 MST
FYI, the same fix backported to 4.1.x: 
https://github.com/kainx/torque/commit/578b0d8ed02777972e128816e2423cde4a35784d
Comment 9 Craig West 2013-01-22 21:03:14 MST
(In reply to comment #8)
> FYI, the same fix backported to 4.1.x: 
> https://github.com/kainx/torque/commit/578b0d8ed02777972e128816e2423cde4a35784d

Tried to apply this against released 4.1.4 and it failed:

patching file src/server/pbsd_main.c
Hunk #4 succeeded at 257 (offset 1 line).
Hunk #5 succeeded at 1651 (offset 20 lines).
Hunk #6 succeeded at 1674 (offset 20 lines).
Hunk #7 FAILED at 2936.
Hunk #8 FAILED at 3127.
Hunk #9 succeeded at 3190 (offset 15 lines).
2 out of 9 hunks FAILED -- saving rejects to file src/server/pbsd_main.c.rej
Comment 10 Michael Jennings 2013-01-23 10:14:40 MST
Created an attachment (id=128) [details]
Patch against 4.1.4 tarball

(In reply to comment #9)
> Tried to apply this against released 4.1.4 and it failed:
> 
> patching file src/server/pbsd_main.c
> Hunk #4 succeeded at 257 (offset 1 line).
> Hunk #5 succeeded at 1651 (offset 20 lines).
> Hunk #6 succeeded at 1674 (offset 20 lines).
> Hunk #7 FAILED at 2936.
> Hunk #8 FAILED at 3127.
> Hunk #9 succeeded at 3190 (offset 15 lines).
> 2 out of 9 hunks FAILED -- saving rejects to file src/server/pbsd_main.c.rej

Not surprising; large chunks of code had to be removed, and if patch sees any
differences at all, it will barf.

Here's a patch against the 4.1.4 tarball.
Comment 11 Michael Jennings 2013-01-25 18:03:29 MST
(In reply to comment #10)
> Here's a patch against the 4.1.4 tarball.

Chris?  Craig?  Any happiness?  :-)
Comment 12 Craig West 2013-02-05 22:35:36 MST
(In reply to comment #11)
> (In reply to comment #10)
> > Here's a patch against the 4.1.4 tarball.
> 
> Chris?  Craig?  Any happiness?  :-)

Michael,

I've been able to get it built. There is a problem in that the Makefile.am file
is patched, but the Makefile.in is not regenerated as part of the autogen.sh
script. The issue seems related to automake rather than Michael's patch. I have
a patch for the Makefile.in that allows it to build, which I will attach. I
don't consider my patch necessary for upstream.

Now I have it built it seems to be doing what it should, at least in my use
cases. It detects the symlinks being altered and restarts pbs_mom as required.
Touching the pbs_mom exec works as well.

Craig.
Comment 13 Craig West 2013-02-05 22:35:42 MST
Created an attachment (id=129) [details]
Patches libutil Makefile.in

Generated for testing only, shouldn't be needed in a release as the Makefile.am
should generate it properly.
Comment 14 Chris Samuel 2013-02-06 18:40:14 MST
(In reply to comment #11)

> (In reply to comment #10)
> > Here's a patch against the 4.1.4 tarball.
> 
> Chris?  Craig?  Any happiness?  :-)

Sorry, haven't been keeping up.

We're still on 2.4.x so not had a chance to look if it's feasible to backport
(and I suspect nobody would be interested).

cheers!
Chris
Comment 15 Michael Jennings 2013-02-11 17:41:21 MST
(In reply to comment #14)
> We're still on 2.4.x so not had a chance to look if it's feasible to backport
> (and I suspect nobody would be interested).

It shouldn't be too difficult.  The changes are fairly straight-forward.  The
patches won't apply cleanly, but it's pretty clear which blocks of code need to
be nuked.

Unfortunately I'm swamped right now, but if that changes I can attempt a
backport.