[torqueusers] Solution: torque-1.2.0p6 build problem using gcc 3.4 (RHEL4 or FC4)

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Mon Sep 19 09:09:26 MDT 2005

Dear Torque users,

We have previously discussed a problem starting LAM-MPI parallel jobs
with torque-1.2.0p6 in this thread:

If you use Torque on Redhat Enterprise Linux 4, Fedora Core 4 or any
other system using gcc 3.4 (or later), you should know about a
problem caused by a new feature in gcc 3.4, as well as the solution
to this problem:

We found that the Torque build process has a problem with
gcc 3.4.3, namely that a "make install" will cause a second,
superfluous recompilation of everything.  If you're building
an RPM, this causes subtle problems in the resulting RPMs
because some hardcoded paths may be incorrect.  This was the
problem that made LAM-MPI booting fail because pbs_mom
could not find the pbs_demux executable (see the above thread).

The quick summary:

1. With Torque up to and including 1.2.0p6, a workaround is to
    configure Torque with an additional CFLAGS option
    -fno-working-directory, if your system uses gcc 3.4 or newer.
2. Torque 1.2.0p7 (current snapshot and later) has a patch in
    buildutils/makedepend-sh which is the permanent solution,
    so the -fno-working-directory workaround is not needed here.

Additional details:

The gcc 3.4 man-page describes a new feature:
            Enable generation of linemarkers in the preprocessor output that
            will let the compiler know the current working directory at the
            time of preprocessing.  When this option is enabled, the prepro-
            cessor will emit, after the initial linemarker, a second line-
            marker with the current working directory followed by two slashes.

This new default feature causes Torque's buildutils/makedepend-sh
script to add a dependency of all .o files upon the timestamp of
the current working directory in the Makefile, in case you use the
-g flag in CFLAGS (the default).  Look for the following pattern
in the Makefile:

# DO NOT DELETE THIS LINE -- makedepend-sh depends on it
accounting.o: ./accounting.c
accounting.o: /scratch/Torque/torque-1.2.0p6/src/server//

The line terminated with "//" refers to the current working directory.
This dependency causes all .o files to be rebuilt every time you
do a "make" in any directory, including the case where you do a
"make install".

In the case of RPM building, this is a real problem because all files
will be installed into a temporary location.  The pbs_mom will
now have an incorrect hardcoded path to pbs_demux and pbs_rcp,
for example, /var/tmp/torque-1.2.0p6-buildroot/usr/sbin/pbs_demux
(check this by "strings /usr/sbin/pbs_mom | grep pbs_demux").

In this scenario all parallel jobs using the "tm" boot interface
will fail because the pbs_demux process failed to be started
by pbs_mom.  A simple test to perform is to run "pbsdsh hostname"
within a multi-node PBS batch job.  If pbsdsh gives error messages,
you may have the above problem, and other environments such as
LAM-MPI using the "tm" interface are going to fail as well.

If you want to patch your current Torque installation, here's
the diff (now in the CVS for 1.2.0p7) as provided by Garrick:

--- buildutils/makedepend-sh_orig       2005-09-18 10:04:34.000000000 -0700
+++ buildutils/makedepend-sh    2005-09-18 10:04:05.000000000 -0700
@@ -575,6 +575,7 @@

                  eval $CPP $arg_cc $d/$s $errout | \
                    sed -n -e "s;^\# [0-9][0-9 ]*\"\(.*\)\";$f: \1;p" | \
+                  grep -v "$PWD//\$" | \
                    grep -v "$s\$" | grep -v command | grep -v built-in | \
                    sed -e 's;\([^ :]*: [^ ]*\).*;\1;' \
                    >> $TMP

Many thanks go to Garrick Staples (USC) for much ping-pong debugging
and for coming up with the patch as well as the -fno-working-directory

Ole Holm Nielsen
Department of Physics, Technical University of Denmark

More information about the torqueusers mailing list