From knielson at adaptivecomputing.com Thu Sep 1 10:22:10 2011
From: knielson at adaptivecomputing.com (Ken Nielson)
Date: Thu, 01 Sep 2011 10:22:10 -0600 (MDT)
Subject: [torquedev] Autoconf and TORQUE
In-Reply-To: <677cf157-4407-4288-b68a-937654063ce1@mail>
Message-ID: <63b736cf-a0bd-4a9d-a554-94f0bf7cde80@mail>
Hi all,
The system we use to build TORQUE uses autoconf version 2.59 and automake 1.9.2.
The version of autoconf and automake do not matter to TORQUE 3.0.x but it does make a difference with 2.4.x and 2.5.x because we deliver the Makefile.in files with the build.
My question is, does anyone depend on TORQUE building with autoconf 2.59 or automake 1.9.2? Can we upgrade to newer versions?
Regards
Ken
From bugzilla-daemon at supercluster.org Thu Sep 1 13:00:51 2011
From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org)
Date: Thu, 1 Sep 2011 13:00:51 -0600 (MDT)
Subject: [torquedev] [Bug 156] New: FIFO pbs_sched crash in check_nodes
Message-ID:
http://www.clusterresources.com/bugzilla/show_bug.cgi?id=156
Summary: FIFO pbs_sched crash in check_nodes
Product: TORQUE
Version: 2.4.x
Platform: PC
OS/Version: Linux
Status: NEW
Severity: normal
Priority: P5
Component: pbs_sched
AssignedTo: jrosenquist at adaptivecomputing.com
ReportedBy: aripollak at gmail.com
CC: torquedev at supercluster.org
Estimated Hours: 0.0
pbs_sched coredumped this morning, and got this backtrace:
#0 pbs_rescquery (c=0, resclist=0xffd5fa08, num_resc=1, available=0xffd5fa18,
allocated=0xffd5fa14, reserved=0xffd5fa10, down=0xffd5fa0c) at
../Libifl/pbsD_resc.c:215
#1 0x08054ff0 in check_nodes (pbs_sd=0, jinfo=0x8d0ac68, ninfo_arr=0x0) at
check.c:507
#2 0x0805523c in is_ok_to_run_job (pbs_sd=0, sinfo=0x88089f0, qinfo=0x882e430,
jinfo=0x8d0ac68)
at check.c:185
#3 0x0804ce75 in scheduling_cycle (sd=0) at fifo.c:486
#4 0x0804d119 in schedule (cmd=-2754024, sd=0) at fifo.c:383
#5 0x0804bc24 in main (argc=1, argv=0xffd5fe74) at pbs_sched.c:1220
It appears that pbs_rescquery() is expecting arrays for available, allocated,
reserved, and down, but check_nodes() is sending it ints instead.
I'm using TORQUE 2.4.16.
--
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
From bugzilla-daemon at supercluster.org Thu Sep 1 14:11:53 2011
From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org)
Date: Thu, 1 Sep 2011 14:11:53 -0600 (MDT)
Subject: [torquedev] [Bug 156] FIFO pbs_sched crash in check_nodes
In-Reply-To:
References:
Message-ID: <20110901201153.5649A678263@http.supercluster.org>
http://www.clusterresources.com/bugzilla/show_bug.cgi?id=156
--- Comment #1 from Ari Pollak 2011-09-01 14:11:53 MDT ---
Actually, passing int by reference doesn't really matter since num_resc=1.
Upon slightly further investigation, the problem is this:
(gdb) print reply->brp_un
$3 = {brp_jid = '\0' , brp_select = 0x0, brp_status =
{ll_prior = 0x0,
ll_next = 0x0, ll_struct = 0x0}, brp_statc = 0x0, brp_txt = {brp_txtlen =
0, brp_str = 0x0},
brp_locate = '\0' , brp_rescq = {brq_number = 0,
brq_avail = 0x0,
brq_alloc = 0x0, brq_resvd = 0x0, brq_down = 0x0}}
(gdb) print reply->brp_un.brp_rescq.brq_avail
$4 = (int *) 0x0
--
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
From mej at lbl.gov Thu Sep 1 16:30:30 2011
From: mej at lbl.gov (Michael Jennings)
Date: Thu, 1 Sep 2011 15:30:30 -0700
Subject: [torquedev] Autoconf and TORQUE
In-Reply-To: <63b736cf-a0bd-4a9d-a554-94f0bf7cde80@mail>
References: <677cf157-4407-4288-b68a-937654063ce1@mail>
<63b736cf-a0bd-4a9d-a554-94f0bf7cde80@mail>
Message-ID: <20110901223028.GD15835@lbl.gov>
On Thursday, 01 September 2011, at 10:22:10 (-0600),
Ken Nielson wrote:
> The system we use to build TORQUE uses autoconf version 2.59 and
> automake 1.9.2.
>
> The version of autoconf and automake do not matter to TORQUE 3.0.x
> but it does make a difference with 2.4.x and 2.5.x because we
> deliver the Makefile.in files with the build.
>
> My question is, does anyone depend on TORQUE building with autoconf
> 2.59 or automake 1.9.2? Can we upgrade to newer versions?
Anyone can regenerate those files at any time by re-running autoconf,
automake, etc.
Unless you actually use macros which don't exist in the older
versions, it shouldn't matter at all. Just don't let the upgrading of
your local build tools tempt you to use the more modern (and thus not
backward compatible) macros. :)
Michael
--
Michael Jennings
Linux Systems and Cluster Engineer
High-Performance Computing Services
Bldg 50B-3209E W: 510-495-2687
MS 050C-3396 F: 510-486-8615
From samuel at unimelb.edu.au Thu Sep 1 19:46:15 2011
From: samuel at unimelb.edu.au (Christopher Samuel)
Date: Fri, 02 Sep 2011 11:46:15 +1000
Subject: [torquedev] Environment variable for PBS script name ?
In-Reply-To: <72d7ebc0-6b9c-408a-bc3b-49f2e858ff3e@mail>
References: <72d7ebc0-6b9c-408a-bc3b-49f2e858ff3e@mail>
Message-ID: <4E603567.1090902@unimelb.edu.au>
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 01/09/11 03:06, Ken Nielson wrote:
> Do they want the full path of the script from the
> submission host or the path where the script resides
> on the execution host?
On the submission host please.
For the interim I've suggested they pull the script
name out with qstat -f and then combine that with
$PBS_O_WORKDIR..
cheers,
Chris
- --
Christopher Samuel - Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.unimelb.edu.au/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAk5gNWcACgkQO2KABBYQAh8ULACghFNYDZfdrTgpBW6mxFhN1Ug+
XfUAn1rHnPOPw6I/nA9WtYqeVSD1/sXn
=93eF
-----END PGP SIGNATURE-----
From knielson at adaptivecomputing.com Fri Sep 2 06:34:28 2011
From: knielson at adaptivecomputing.com (Ken Nielson)
Date: Fri, 02 Sep 2011 06:34:28 -0600 (MDT)
Subject: [torquedev] Environment variable for PBS script name ?
In-Reply-To: <4E603567.1090902@unimelb.edu.au>
Message-ID: <0b53ba31-9593-4e58-85e1-01af926e8b5e@mail>
----- Original Message -----
> From: "Christopher Samuel"
> To: torquedev at supercluster.org
> Sent: Thursday, September 1, 2011 7:46:15 PM
> Subject: Re: [torquedev] Environment variable for PBS script name ?
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 01/09/11 03:06, Ken Nielson wrote:
>
> > Do they want the full path of the script from the
> > submission host or the path where the script resides
> > on the execution host?
>
> On the submission host please.
>
> For the interim I've suggested they pull the script
> name out with qstat -f and then combine that with
> $PBS_O_WORKDIR..
>
> cheers,
> Chris
> - --
> Christopher Samuel - Senior Systems Administrator
> VLSCI - Victorian Life Sciences Computation Initiative
> Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
> http://www.vlsci.unimelb.edu.au/
>
Chris,
Would you put a request for this in Bugzilla?
Regards
Ken
From knielson at adaptivecomputing.com Fri Sep 2 10:53:55 2011
From: knielson at adaptivecomputing.com (Ken Nielson)
Date: Fri, 02 Sep 2011 10:53:55 -0600 (MDT)
Subject: [torquedev] TORQUE 2.5.8 released
In-Reply-To: <55d70021-d8fa-4625-b175-2618f4cadb39@mail>
Message-ID:
TORQUE 2.5.8 is now available for general release.
There were no new features added in this release but there were some notable bug fixes.
Among them was a fix for the queue resource procct. This resource is intended to be internal to TORQUE. The problem occurred when a job could not be routed to an execution queue and had to be put in a routing queue. The procct count would be interpreted by Moab and Maui as a generic resource. Since the procct generic resource did not exist the job would be stuck. This has been fixed.
Another fix was for NVIDIA gpu mode setting. If exclusive_process or default were requested as a GPU mode and TORQUE was using a scheduler the mode would not be changed. This was due to the scheduler stripping of the mode designation. This problem has now been fixed. However, to be able to use the default mode will require the addition of a property to the nodes file of "default" until a new release of Moab is available. The version is at this time not known.
Please see the CHANGELOG for all bugs fixed in this version of TORQUE.
The release can be downloaded at http://www.clusterresources.com/downloads/torque/torque-2.5.8.tar.gz
Thanks to everyone who has made this build possible.
Ken Nielson
Adaptive Computing
From bugzilla-daemon at supercluster.org Fri Sep 2 13:40:49 2011
From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org)
Date: Fri, 2 Sep 2011 13:40:49 -0600 (MDT)
Subject: [torquedev] [Bug 157] New: Patches for OS X 10.5.8
Message-ID:
http://www.clusterresources.com/bugzilla/show_bug.cgi?id=157
Summary: Patches for OS X 10.5.8
Product: TORQUE
Version: 3.0.x
Platform: Macintosh
OS/Version: Mac OS
Status: NEW
Severity: normal
Priority: P5
Component: pbs_server
AssignedTo: dbeer at adaptivecomputing.com
ReportedBy: nathanweeks at yahoo.com
CC: torquedev at supercluster.org
Estimated Hours: 0.0
TORQUE 3.0.2 needed a few source-code modifications to compile and run on OS X
10.5.8; patches are attached. A brief summary of the issues fixed:
u_threadpool.c:
* All versions of OS X (through 10.7) lack clock_gettime(); this patch works
around this by detecting if clock_gettime() exists in configure.ac, and if not,
use gettimeofday() as suggested in the OS X 10.6/10.7 manual page for
pthread_cond_timedwait()
configure.ac:
* There is a compilation error when linking librt, as OS X doesn't have such a
library; the suggested workaround to link with librt only if clock_gettime()
requires it (which it doesn't on some platforms, e.g., FreeBSD). This is in
addition to the previously-mentioned check for the existence of
clock_gettime().
pbs_mkdirs.in:
* "echo" is a bash builtin, and when bash is invoked as "/bin/sh", the "-e"
option loses its special meaning, causing an "-e" to appear at the beginning of
the first line in the default server_priv/nodes.
--
Nathan Weeks
IT Specialist
USDA-ARS
http://www.public.iastate.edu/~weeks
--
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
From bugzilla-daemon at supercluster.org Fri Sep 2 13:41:36 2011
From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org)
Date: Fri, 2 Sep 2011 13:41:36 -0600 (MDT)
Subject: [torquedev] [Bug 157] Patches for OS X 10.5.8
In-Reply-To:
References:
Message-ID: <20110902194136.82C4E4121400@http.supercluster.org>
http://www.clusterresources.com/bugzilla/show_bug.cgi?id=157
--- Comment #1 from Nathan Weeks 2011-09-02 13:41:36 MDT ---
Created an attachment (id=94)
--> (http://www.clusterresources.com/bugzilla/attachment.cgi?id=94)
patch
--
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
From bugzilla-daemon at supercluster.org Fri Sep 2 13:42:09 2011
From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org)
Date: Fri, 2 Sep 2011 13:42:09 -0600 (MDT)
Subject: [torquedev] [Bug 157] Patches for OS X 10.5.8
In-Reply-To:
References:
Message-ID: <20110902194209.9597B4121401@http.supercluster.org>
http://www.clusterresources.com/bugzilla/show_bug.cgi?id=157
Nathan Weeks changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #94|patch |pbs_mkdirs.in.patch
description| |
--
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
From bugzilla-daemon at supercluster.org Fri Sep 2 13:42:46 2011
From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org)
Date: Fri, 2 Sep 2011 13:42:46 -0600 (MDT)
Subject: [torquedev] [Bug 157] Patches for OS X 10.5.8
In-Reply-To:
References:
Message-ID: <20110902194246.E8A7F4121403@http.supercluster.org>
http://www.clusterresources.com/bugzilla/show_bug.cgi?id=157
--- Comment #2 from Nathan Weeks 2011-09-02 13:42:46 MDT ---
Created an attachment (id=95)
--> (http://www.clusterresources.com/bugzilla/attachment.cgi?id=95)
u_threadpool.c.patch
--
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
From bugzilla-daemon at supercluster.org Fri Sep 2 13:43:25 2011
From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org)
Date: Fri, 2 Sep 2011 13:43:25 -0600 (MDT)
Subject: [torquedev] [Bug 157] Patches for OS X 10.5.8
In-Reply-To:
References:
Message-ID: <20110902194325.0E50F4121405@http.supercluster.org>
http://www.clusterresources.com/bugzilla/show_bug.cgi?id=157
--- Comment #3 from Nathan Weeks 2011-09-02 13:43:24 MDT ---
Created an attachment (id=96)
--> (http://www.clusterresources.com/bugzilla/attachment.cgi?id=96)
configure.ac.patch
--
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
From bugzilla-daemon at supercluster.org Sun Sep 4 15:39:41 2011
From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org)
Date: Sun, 4 Sep 2011 15:39:41 -0600 (MDT)
Subject: [torquedev] [Bug 158] New: [2.5.8] suse.pbs_mom uses never touched
file /var/lock/subsys/pbs_mom
Message-ID:
http://www.clusterresources.com/bugzilla/show_bug.cgi?id=158
Summary: [2.5.8] suse.pbs_mom uses never touched file
/var/lock/subsys/pbs_mom
Product: TORQUE
Version: 2.5.x
Platform: PC
OS/Version: Linux
Status: NEW
Severity: enhancement
Priority: P5
Component: pbs_mom
AssignedTo: knielson at adaptivecomputing.com
ReportedBy: burnus at net-b.de
CC: torquedev at supercluster.org
Estimated Hours: 0.0
On SUSE Linux, the directory /var/lock/subsys/ does not exist, the file is
never touched (contrary to contrib/init.d/pbs_mom) and I get the following
rpmlint warning:
torque-mom.x86_64: W: subsys-unsupported /etc/init.d/pbs_mom
The init script uses /var/lock/subsys which is not supported by this
distribution.
I think the following patch should be correct:
--- contrib/init.d/suse.pbs_mom.orig 2011-09-04 23:36:11.000000000 +0200
+++ contrib/init.d/suse.pbs_mom 2011-09-04 23:36:16.000000000 +0200
@@ -47,7 +47,7 @@
rc_status -v
;;
purge)
- [ -f /var/lock/subsys/pbs_mom ] && $0 stop
+ checkproc -p $PIDFILE $PBS_DAEMON && $0 stop
echo -n "Starting TORQUE Mom with purge: "
startproc $PBS_DAEMON -r $DAEMON_ARGS
rc_status -v
--
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
From bugzilla-daemon at supercluster.org Sun Sep 4 15:58:33 2011
From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org)
Date: Sun, 4 Sep 2011 15:58:33 -0600 (MDT)
Subject: [torquedev] [Bug 158] [2.5.8] suse.pbs_mom uses never touched file
/var/lock/subsys/pbs_mom
In-Reply-To:
References:
Message-ID: <20110904215833.CC07767885E@http.supercluster.org>
http://www.clusterresources.com/bugzilla/show_bug.cgi?id=158
--- Comment #1 from Tobias Burnus 2011-09-04 15:58:33 MDT ---
While patching files, could you also apply the following patch, which silences
the warning:
torque.x86_64: W: incorrect-fsf-address
/usr/share/doc/packages/torque/contrib/pbstop
The Free Software Foundation address in this file seems to be outdated or
misspelled.
--- contrib/pbstop.orig 2011-09-04 23:52:58.000000000 +0200
+++ contrib/pbstop 2011-09-04 23:56:22.000000000 +0200
@@ -14,7 +14,7 @@
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
-# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301,
USA.
#
# Latest version of this software may be found at:
# http://www-rcf.usc.edu/~garrick/perl-PBS
--
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
From siegert at sfu.ca Wed Sep 7 17:44:31 2011
From: siegert at sfu.ca (Martin Siegert)
Date: Wed, 7 Sep 2011 16:44:31 -0700
Subject: [torquedev] procs resource specification peculiarity
Message-ID: <20110907234431.GA6865@stikine.sfu.ca>
Hi,
I vaguely remeber a discussion on this list about whether requests
of the form
-l nodes=x:ppn=y+procs=z
should be allowed and as far as I remember the common response was: no.
However, the code in qsub.c looks really strange to me:
case 'l':
l_opt = passet;
/* a ,procs= in the node spec is illegal. Validate the node spec */
if(strstr(optarg, ",procs="))
{
printf("illegal node spec: %s\n", optarg);
return(-1);
}
First of all the comment is strange: what has this to do with the
node spec? Instead, this has the bizarre effect that job submissions like
qsub -l walltime=24:00:00,procs=42 job.pbs
are rejected with an "illegal node spec" error message, whereas the
supposedly equivalent submission
qsub -l procs=42,walltime=24:00:00 job.pbs
is accepted. This looks like a bug to me, correct? At least it is
highly confusing for our users.
It is my understanding that a "," in a -l argument should be totally
equivalent to another -l argument, i.e.,
-l spec1,spec2,spec3
should be equivalent to
-l spec1 -l spec2 -l spec3
correct? If my recollection is correct and we agreed that
-l nodes=x:ppn=y+procs=z should be rejected while
-l nodes=x:ppn -l procs=z should be used instead then even
-l nodes=x:ppn=y,procs=z should be accepted as well.
Instead, the code in qsub.c appears to accept
-l nodes=x:ppn=y+procs=z while it rejects any spec that contains
",procs". Just the opposite of what I thought we had discussed.
Cheers,
Martin
--
Martin Siegert
Simon Fraser University
From cwest at vpac.org Wed Sep 7 21:30:57 2011
From: cwest at vpac.org (Craig West)
Date: Thu, 08 Sep 2011 13:30:57 +1000
Subject: [torquedev] TORQUE 2.5.8 released
In-Reply-To:
References:
Message-ID: <4E6836F1.4020508@vpac.org>
Hi Ken,
> Please see the CHANGELOG for all bugs fixed in this version of TORQUE.
>
> The release can be downloaded at http://www.clusterresources.com/downloads/torque/torque-2.5.8.tar.gz
Seems we still need to download the entire package just to read the
CHANGELOG. The change logs were being put in the location below for a
while, can we please continue to do that?
http://www.clusterresources.com/downloads/torque/CHANGELOGS/
Thanks,
Craig.
--
Craig West Systems Manager
Victorian Partnership for Advanced Computing
110 Victoria Street, Carlton South VIC 3053
P: +61 3 9925 4751 E: cwest at vpac.org
http://www.vpac.org
From knielson at adaptivecomputing.com Thu Sep 8 07:53:41 2011
From: knielson at adaptivecomputing.com (Ken Nielson)
Date: Thu, 08 Sep 2011 07:53:41 -0600 (MDT)
Subject: [torquedev] TORQUE 2.5.8 released
In-Reply-To: <4E6836F1.4020508@vpac.org>
Message-ID:
----- Original Message -----
> From: "Craig West"
> To: torquedev at supercluster.org
> Sent: Wednesday, September 7, 2011 9:30:57 PM
> Subject: Re: [torquedev] TORQUE 2.5.8 released
>
>
> Hi Ken,
>
> > Please see the CHANGELOG for all bugs fixed in this version of
> > TORQUE.
> >
> > The release can be downloaded at
> > http://www.clusterresources.com/downloads/torque/torque-2.5.8.tar.gz
>
> Seems we still need to download the entire package just to read the
> CHANGELOG. The change logs were being put in the location below for a
> while, can we please continue to do that?
>
> http://www.clusterresources.com/downloads/torque/CHANGELOGS/
>
> Thanks,
> Craig.
>
> --
> Craig West Systems Manager
> Victorian Partnership for Advanced Computing
> 110 Victoria Street, Carlton South VIC 3053
> P: +61 3 9925 4751 E: cwest at vpac.org
> http://www.vpac.org
Craig,
Thanks for pointing that out.
Sorry. It will be up shortly.
Ken
From knielson at adaptivecomputing.com Fri Sep 9 16:54:05 2011
From: knielson at adaptivecomputing.com (Ken Nielson)
Date: Fri, 09 Sep 2011 16:54:05 -0600 (MDT)
Subject: [torquedev] auto tool problem
In-Reply-To:
Message-ID: <9de50321-101c-403c-a96d-785875862567@mail>
Hi all,
We are having a problem with configure.ac and autoconf for the trunk when building on CentOS 5 and CentOS 3.
Autoconf produces the following error.
configure.ac:49: error: possibly undefined macro: AC_MSG_ERROR
If this token and others are legitimate, please use m4_pattern_allow.
See the Autoconf documentation.
This compiles fine on Ubuntu versions 10 and 11. We have used autoconf version 2.65 and 2.67. The libtool version is 2.4 and automake is version 1.11.
On the CentOS we tried the above versions as well as autoconf version 2.59
I also tried this in CentOS 6. It builds successfully.
Thanks for the help.
You can use subversion to check out the trunk at svn://clusterresources.com/torque/trunk.
Regards
Ken Nielson
Adaptive Computing
From knielson at adaptivecomputing.com Thu Sep 15 12:19:40 2011
From: knielson at adaptivecomputing.com (Ken Nielson)
Date: Thu, 15 Sep 2011 12:19:40 -0600 (MDT)
Subject: [torquedev] auto tool problem
In-Reply-To: <9de50321-101c-403c-a96d-785875862567@mail>
Message-ID:
----- Original Message -----
> From: "Ken Nielson"
> To: "Torque Developers mailing list"
> Sent: Friday, September 9, 2011 4:54:05 PM
> Subject: [torquedev] auto tool problem
>
> Hi all,
>
> We are having a problem with configure.ac and autoconf for the trunk
> when building on CentOS 5 and CentOS 3.
>
> Autoconf produces the following error.
>
> configure.ac:49: error: possibly undefined macro: AC_MSG_ERROR
> If this token and others are legitimate, please use
> m4_pattern_allow.
> See the Autoconf documentation.
>
> This compiles fine on Ubuntu versions 10 and 11. We have used
> autoconf version 2.65 and 2.67. The libtool version is 2.4 and
> automake is version 1.11.
>
>
Hi all,
It turns out that this is not an OS problem at all. The problem was the hwloc was not installed on the machines where we were building TORQUE.
So another possible solution to the many I found out on the web concerning the autoconf error : possibly undefined macro: issue is that a library may be missing.
Regards
Ken Nielson
From mej at lbl.gov Thu Sep 15 12:47:47 2011
From: mej at lbl.gov (Michael Jennings)
Date: Thu, 15 Sep 2011 11:47:47 -0700
Subject: [torquedev] auto tool problem
In-Reply-To:
References: <9de50321-101c-403c-a96d-785875862567@mail>
Message-ID: <20110915184744.GZ10643@lbl.gov>
On Thursday, 15 September 2011, at 12:19:40 (-0600),
Ken Nielson wrote:
> It turns out that this is not an OS problem at all. The problem was
> the hwloc was not installed on the machines where we were building
> TORQUE.
Wow. I might've given up before figuring that one out. :) I've been
trying to reproduce the problem here without much success. Glad you
got it resolved!
Michael
--
Michael Jennings
Linux Systems and Cluster Engineer
High-Performance Computing Services
Bldg 50B-3209E W: 510-495-2687
MS 050C-3396 F: 510-486-8615
From pablo.fernandez at cscs.ch Fri Sep 16 09:35:38 2011
From: pablo.fernandez at cscs.ch (Pablo Fernandez)
Date: Fri, 16 Sep 2011 17:35:38 +0200
Subject: [torquedev] possible bug in qstat -f
Message-ID: <201109161735.38162.pablo.fernandez@cscs.ch>
Dear devels,
We have been fighting with a third-party script that depends on the output of
qstat -f... and it seems the output format has changed slightly, and the
script does not work anymore. We have changed the script to make it work, but
as I mentioned, is 3-party, so no further updates are possible.
Besides, it really seems to me like a typo somewhere in the pbs client code
(2.4.16), so I thought it may be worth sharing it here.
So, if you type qstat -f and you show all special characters, you get this:
Job Id: 5304000.lrms02.lcg.cscs.ch$
Job_Name = cre01_914471456$
Job_Owner = lhcbprd01 at cream01.lcg.cscs.ch$
job_state = Q$
queue = lhcb$
server = lrms02.lcg.cscs.ch$
Checkpoint = u$
ctime = Sat Sep 10 00:10:41 2011$
Error_Path = cream01.lcg.cscs.ch:/dev/null$
Output_Path = cream01.lcg.cscs.ch:/dev/null$
Priority = 0$
[blabla]
euser = lhcbprd01$
egroup = lhcb$
hashname = 5304000.lrms02.lcg.cscs.ch$
queue_rank = 1300167$
queue_type = E$
comment = job rejected by RM 'lrms02' - job started on hostlist wn176.lcg.$
^Icscs.ch at time 15:20:19_09/16,$
^I job reported idle at time 15:23:14_09/16 (see RM logs for details)$
$
etime = Fri Sep 16 15:50:32 2011$
submit_args = /tmp/cre01_914471456$
fault_tolerant = False$
$
Job Id: 5304708.lrms02.lcg.cscs.ch$
(the next job continues...)
So, if you take a look closely, the "comment=" line finishes with two returns,
where all the other fields finish with one. So, this means any parser that
expects a blank line to separate two jobs, will fail.
I guess the comment line was introduced in the 2.4 series, because the scripts
were designed for 2.3, and they don't fail there.
Is this actually a bug that could be fixed within the same series (2.4)?
Thanks a lot!
BR/Pablo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torquedev/attachments/20110916/d52e9527/attachment.html
From samuel at unimelb.edu.au Sun Sep 18 23:18:00 2011
From: samuel at unimelb.edu.au (Christopher Samuel)
Date: Mon, 19 Sep 2011 15:18:00 +1000
Subject: [torquedev] possible bug in qstat -f
In-Reply-To: <201109161735.38162.pablo.fernandez@cscs.ch>
References: <201109161735.38162.pablo.fernandez@cscs.ch>
Message-ID: <4E76D088.3070708@unimelb.edu.au>
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 17/09/11 01:35, Pablo Fernandez wrote:
> So, if you take a look closely, the "comment=" line finishes with two
> returns, where all the other fields finish with one. So, this means any
> parser that expects a blank line to separate two jobs, will fail.
I don't see that here (2.4.16), looking at one of our
jobs with a comment set by Moab 6.1.1 in a hex editor
only shows one new line character (0A).
Could it be that your Moab is setting a comment with
an extra new line at the end ?
cheers,
Chris
- --
Christopher Samuel - Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.unimelb.edu.au/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAk520IgACgkQO2KABBYQAh9iOgCgj2CoqN6xZYYihjBVHp3nTI2W
MRIAn0CrE8bhv5DTU24GdwR1WeflyW57
=Xpuu
-----END PGP SIGNATURE-----
From pablo.fernandez at cscs.ch Mon Sep 19 03:19:57 2011
From: pablo.fernandez at cscs.ch (Pablo Fernandez)
Date: Mon, 19 Sep 2011 11:19:57 +0200
Subject: [torquedev] possible bug in qstat -f
In-Reply-To: <4E76D088.3070708@unimelb.edu.au>
References: <201109161735.38162.pablo.fernandez@cscs.ch>
<4E76D088.3070708@unimelb.edu.au>
Message-ID: <201109191119.57691.pablo.fernandez@cscs.ch>
Hi,
> > So, if you take a look closely, the "comment=" line finishes with two
> > returns, where all the other fields finish with one. So, this means any
> > parser that expects a blank line to separate two jobs, will fail.
>
> I don't see that here (2.4.16), looking at one of our
> jobs with a comment set by Moab 6.1.1 in a hex editor
> only shows one new line character (0A).
>
> Could it be that your Moab is setting a comment with
> an extra new line at the end ?
I guess it could be... I have just upgraded to Moab 6.1.2, and the comments
are still there, with the extra CR, but maybe they are old comments that
torque keeps. I will have to wait for new comments, I guess.
Thanks!
Pablo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torquedev/attachments/20110919/b103338f/attachment.html
From jayavant.patil82 at gmail.com Mon Sep 19 05:20:28 2011
From: jayavant.patil82 at gmail.com (Jayavant Patil)
Date: Mon, 19 Sep 2011 16:50:28 +0530
Subject: [torquedev] Automatic Job REQUEUE
Message-ID:
Hi,
I am using TORQUE 3.0.0 and Maui 3.3. When the compute node (on which
the job is running) fails due to crash, power failure or any other reason,
how the job which was running on that compute node should get *automatically
* requeued? (I am aware that with *qrerun* we can manually rerun the job but
I don't want this)
Thanks in advance.
--
Thanks & Regards,
Jayavant N. Patil
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torquedev/attachments/20110919/30cb1ef1/attachment-0001.html
From s.prabhakaran at grs-sim.de Tue Sep 27 06:14:20 2011
From: s.prabhakaran at grs-sim.de (Suraj Prabhakaran)
Date: Tue, 27 Sep 2011 14:14:20 +0200
Subject: [torquedev] Documentation?
In-Reply-To: <201109191119.57691.pablo.fernandez@cscs.ch>
References: <201109161735.38162.pablo.fernandez@cscs.ch>
<4E76D088.3070708@unimelb.edu.au>
<201109191119.57691.pablo.fernandez@cscs.ch>
Message-ID: <2DDA3FD4-AE99-449C-AA94-6BD5163301F9@grs-sim.de>
Hello all,
Are there any design documentation of PBS/Torque (other than the admin guide) available somewhere? If so, could anyone please point me to that?
I have been able to find the old PBS (2.2) external reference specification, internal design specification, and requirements specification. Haven't been able to find the external design specification. Is there any documentation available that delves into design?
Thank you,
Suraj
From l.flis at cyf-kr.edu.pl Tue Sep 27 17:15:34 2011
From: l.flis at cyf-kr.edu.pl (Lukasz Flis)
Date: Wed, 28 Sep 2011 01:15:34 +0200
Subject: [torquedev] Torque 2.5.8 - memory leaks
Message-ID: <4E825916.2000802@cyf-kr.edu.pl>
Hi,
We are running medium cluster with Torque and Moab,
Average number of jobs is usually around 4k
Number of nodes: 950
Number of cores (at the moment) 10k
We have recently migrated from 2.4.12 to 2.5.8. Unfortunately we are
observing torque memory issues. I'm attaching daily graph from our
ganglia monitoring system. On average it takes 3 hours for torque to
consume 8 Gbytes of memory. Afterwards pbs_server daemon needs to be
restarted as it is unable to perform fork operation due to lack of
available memory, then OOM killer gets in action.
pbs_server: LOG_ERROR::Cannot allocate memory (12) in send_job, fork
failed
Due to the scale of the system I am unable to run pbs_server under
valgrind to find the source of leak. I did some testing on our test
cluster but i'm not sure how accurate results valgrind provides:
a lot of messages point to decode_str function:
==31895== 41 bytes in 4 blocks are definitely lost in loss record 40 of 81
==31895== at 0x4A05E1C: malloc (vg_replace_malloc.c:195)
==31895== by 0x452BD8: decode_str (attr_fn_str.c:144)
==31895== by 0x40A768: recov_attr (attr_recov.c:512)
==31895== by 0x44909C: svr_recov (svr_recov.c:204)
==31895== by 0x41F4A5: get_svr_attr (pbsd_init.c:2387)
==31895== by 0x4213A2: main (pbsd_main.c:1465)
so as to decode_arst_direct:
==31895== 143 (56 direct, 87 indirect) bytes in 1 blocks are definitely
lost in loss record 55 of 84
==31895== at 0x4A05E1C: malloc (vg_replace_malloc.c:195)
==31895== by 0x44F1B8: decode_arst_direct (attr_fn_arst.c:189)
==31895== by 0x44F3DE: decode_arst (attr_fn_arst.c:311)
==31895== by 0x40A768: recov_attr (attr_recov.c:512)
==31895== by 0x44909C: svr_recov (svr_recov.c:204)
==31895== by 0x41F4A5: get_svr_attr (pbsd_init.c:2387)
==31895== by 0x4213A2: main (pbsd_main.c:1465)
I am going to dig the sources a bit and see if memory allocated by above
functions is freed properly.
However any suggestions and hints will be welcome as I might be unable
to fix it all myself.
Thank you for attention
--
Lukasz Flis
-------------- next part --------------
A non-text attachment was scrubbed...
Name: leaking-torque.png
Type: image/png
Size: 58057 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20110928/1a3bb9a0/attachment-0001.png
From Gareth.Williams at csiro.au Wed Sep 28 16:00:13 2011
From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au)
Date: Thu, 29 Sep 2011 08:00:13 +1000
Subject: [torquedev] Documentation?
In-Reply-To: <2DDA3FD4-AE99-449C-AA94-6BD5163301F9@grs-sim.de>
References: <201109161735.38162.pablo.fernandez@cscs.ch>
<4E76D088.3070708@unimelb.edu.au>
<201109191119.57691.pablo.fernandez@cscs.ch>
<2DDA3FD4-AE99-449C-AA94-6BD5163301F9@grs-sim.de>
Message-ID: <007DECE986B47F4EABF823C1FBB19C620102B6D6AE4F@exvic-mbx04.nexus.csiro.au>
> -----Original Message-----
> From: torquedev-bounces at supercluster.org [mailto:torquedev-
> bounces at supercluster.org] On Behalf Of Suraj Prabhakaran
> Sent: Tuesday, 27 September 2011 10:14 PM
> To: Torque Developers mailing list
> Subject: [torquedev] Documentation?
>
> Hello all,
>
> Are there any design documentation of PBS/Torque (other than the admin
> guide) available somewhere? If so, could anyone please point me to
> that?
> I have been able to find the old PBS (2.2) external reference
> specification, internal design specification, and requirements
> specification. Haven't been able to find the external design
> specification. Is there any documentation available that delves into
> design?
>
> Thank you,
> Suraj
> _______________________________________________
> torquedev mailing list
> torquedev at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torquedev
Hi Suraj,
These posts have (broken) links to diagrams of job states. I recall they were good though they may not be what you want.
http://www.supercluster.org/pipermail/torqueusers/2008-January/006706.html
http://www.supercluster.org/pipermail/torqueusers/2007-June/005724.html
Gareth
From knielson at adaptivecomputing.com Wed Sep 28 16:00:32 2011
From: knielson at adaptivecomputing.com (Ken Nielson)
Date: Wed, 28 Sep 2011 16:00:32 -0600 (MDT)
Subject: [torquedev] Torque 2.5.8 - memory leaks
In-Reply-To: <4E825916.2000802@cyf-kr.edu.pl>
Message-ID: <10234caf-bf9b-47c1-9ad9-e75ad00b9052@mail>
----- Original Message -----
> From: "Lukasz Flis"
> To: "Torque Developers mailing list"
> Sent: Tuesday, September 27, 2011 5:15:34 PM
> Subject: [torquedev] Torque 2.5.8 - memory leaks
>
> Hi,
>
> We are running medium cluster with Torque and Moab,
> Average number of jobs is usually around 4k
> Number of nodes: 950
> Number of cores (at the moment) 10k
>
> We have recently migrated from 2.4.12 to 2.5.8. Unfortunately we are
> observing torque memory issues. I'm attaching daily graph from our
> ganglia monitoring system. On average it takes 3 hours for torque to
> consume 8 Gbytes of memory. Afterwards pbs_server daemon needs to be
> restarted as it is unable to perform fork operation due to lack of
> available memory, then OOM killer gets in action.
>
> pbs_server: LOG_ERROR::Cannot allocate memory (12) in send_job,
> fork
> failed
>
> Due to the scale of the system I am unable to run pbs_server under
> valgrind to find the source of leak. I did some testing on our test
> cluster but i'm not sure how accurate results valgrind provides:
>
> a lot of messages point to decode_str function:
> ==31895== 41 bytes in 4 blocks are definitely lost in loss record 40
> of 81
> ==31895== at 0x4A05E1C: malloc (vg_replace_malloc.c:195)
> ==31895== by 0x452BD8: decode_str (attr_fn_str.c:144)
> ==31895== by 0x40A768: recov_attr (attr_recov.c:512)
> ==31895== by 0x44909C: svr_recov (svr_recov.c:204)
> ==31895== by 0x41F4A5: get_svr_attr (pbsd_init.c:2387)
> ==31895== by 0x4213A2: main (pbsd_main.c:1465)
>
> so as to decode_arst_direct:
>
> ==31895== 143 (56 direct, 87 indirect) bytes in 1 blocks are
> definitely
> lost in loss record 55 of 84
> ==31895== at 0x4A05E1C: malloc (vg_replace_malloc.c:195)
> ==31895== by 0x44F1B8: decode_arst_direct (attr_fn_arst.c:189)
> ==31895== by 0x44F3DE: decode_arst (attr_fn_arst.c:311)
> ==31895== by 0x40A768: recov_attr (attr_recov.c:512)
> ==31895== by 0x44909C: svr_recov (svr_recov.c:204)
> ==31895== by 0x41F4A5: get_svr_attr (pbsd_init.c:2387)
> ==31895== by 0x4213A2: main (pbsd_main.c:1465)
>
>
> I am going to dig the sources a bit and see if memory allocated by
> above
> functions is freed properly.
> However any suggestions and hints will be welcome as I might be
> unable
> to fix it all myself.
>
> Thank you for attention
> --
> Lukasz Flis
>
>
What scheduler are you using?
Ken Nielson
From toth at fi.muni.cz Wed Sep 28 16:27:46 2011
From: toth at fi.muni.cz (=?windows-1252?Q?=22Mgr=2E_=8Aimon_T=F3th=22?=)
Date: Thu, 29 Sep 2011 00:27:46 +0200
Subject: [torquedev] Torque 2.5.8 - memory leaks
In-Reply-To: <10234caf-bf9b-47c1-9ad9-e75ad00b9052@mail>
References: <10234caf-bf9b-47c1-9ad9-e75ad00b9052@mail>
Message-ID: <4E839F62.1060901@fi.muni.cz>
> What scheduler are you using?
How is that relevant?
--
Mgr. Simon Toth
From l.flis at cyf-kr.edu.pl Wed Sep 28 16:29:15 2011
From: l.flis at cyf-kr.edu.pl (Lukasz Flis)
Date: Thu, 29 Sep 2011 00:29:15 +0200
Subject: [torquedev] Torque 2.5.8 - memory leaks
In-Reply-To: <10234caf-bf9b-47c1-9ad9-e75ad00b9052@mail>
References: <10234caf-bf9b-47c1-9ad9-e75ad00b9052@mail>
Message-ID: <201109290029.15266.l.flis@cyf-kr.edu.pl>
Hello Ken,
On Thursday 29 September 2011 00:00:32 Ken Nielson wrote:
> ----- Original Message -----
>
> > From: "Lukasz Flis"
> > To: "Torque Developers mailing list"
> > Sent: Tuesday, September 27, 2011 5:15:34 PM
> > Subject: [torquedev] Torque 2.5.8 - memory leaks
> >
> > Hi,
> >
> > We are running medium cluster with Torque and Moab,
> > Average number of jobs is usually around 4k
> > Number of nodes: 950
> > Number of cores (at the moment) 10k
> >
> > We have recently migrated from 2.4.12 to 2.5.8. Unfortunately we are
> > observing torque memory issues. I'm attaching daily graph from our
> > ganglia monitoring system. On average it takes 3 hours for torque to
> > consume 8 Gbytes of memory. Afterwards pbs_server daemon needs to be
> > restarted as it is unable to perform fork operation due to lack of
> > available memory, then OOM killer gets in action.
> >
> > pbs_server: LOG_ERROR::Cannot allocate memory (12) in send_job,
> > fork
> > failed
> >
> > Due to the scale of the system I am unable to run pbs_server under
> > valgrind to find the source of leak. I did some testing on our test
> > cluster but i'm not sure how accurate results valgrind provides:
> >
> > a lot of messages point to decode_str function:
> > ==31895== 41 bytes in 4 blocks are definitely lost in loss record 40
> > of 81
> > ==31895== at 0x4A05E1C: malloc (vg_replace_malloc.c:195)
> > ==31895== by 0x452BD8: decode_str (attr_fn_str.c:144)
> > ==31895== by 0x40A768: recov_attr (attr_recov.c:512)
> > ==31895== by 0x44909C: svr_recov (svr_recov.c:204)
> > ==31895== by 0x41F4A5: get_svr_attr (pbsd_init.c:2387)
> > ==31895== by 0x4213A2: main (pbsd_main.c:1465)
> >
> > so as to decode_arst_direct:
> >
> > ==31895== 143 (56 direct, 87 indirect) bytes in 1 blocks are
> > definitely
> > lost in loss record 55 of 84
> > ==31895== at 0x4A05E1C: malloc (vg_replace_malloc.c:195)
> > ==31895== by 0x44F1B8: decode_arst_direct (attr_fn_arst.c:189)
> > ==31895== by 0x44F3DE: decode_arst (attr_fn_arst.c:311)
> > ==31895== by 0x40A768: recov_attr (attr_recov.c:512)
> > ==31895== by 0x44909C: svr_recov (svr_recov.c:204)
> > ==31895== by 0x41F4A5: get_svr_attr (pbsd_init.c:2387)
> > ==31895== by 0x4213A2: main (pbsd_main.c:1465)
> >
> >
> > I am going to dig the sources a bit and see if memory allocated by
> > above
> > functions is freed properly.
> > However any suggestions and hints will be welcome as I might be
> > unable
> > to fix it all myself.
> >
> > Thank you for attention
> > --
> > Lukasz Flis
>
> What scheduler are you using?
We are using MOAB 6.1.1.
server scheduling is set to True.
We had similar problems with 2.4.12 and Maui but amount of consumed/leaked
memory was smaller as Torque was 32 bit version.
> Ken Nielson
> _______________________________________________
> torquedev mailing list
> torquedev at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torquedev
>