From knielson at adaptivecomputing.com Thu Sep 1 10:22:10 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Thu, 01 Sep 2011 10:22:10 -0600 (MDT) Subject: [torquedev] Autoconf and TORQUE In-Reply-To: <677cf157-4407-4288-b68a-937654063ce1@mail> Message-ID: <63b736cf-a0bd-4a9d-a554-94f0bf7cde80@mail> Hi all, The system we use to build TORQUE uses autoconf version 2.59 and automake 1.9.2. The version of autoconf and automake do not matter to TORQUE 3.0.x but it does make a difference with 2.4.x and 2.5.x because we deliver the Makefile.in files with the build. My question is, does anyone depend on TORQUE building with autoconf 2.59 or automake 1.9.2? Can we upgrade to newer versions? Regards Ken From bugzilla-daemon at supercluster.org Thu Sep 1 13:00:51 2011 From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org) Date: Thu, 1 Sep 2011 13:00:51 -0600 (MDT) Subject: [torquedev] [Bug 156] New: FIFO pbs_sched crash in check_nodes Message-ID: http://www.clusterresources.com/bugzilla/show_bug.cgi?id=156 Summary: FIFO pbs_sched crash in check_nodes Product: TORQUE Version: 2.4.x Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P5 Component: pbs_sched AssignedTo: jrosenquist at adaptivecomputing.com ReportedBy: aripollak at gmail.com CC: torquedev at supercluster.org Estimated Hours: 0.0 pbs_sched coredumped this morning, and got this backtrace: #0 pbs_rescquery (c=0, resclist=0xffd5fa08, num_resc=1, available=0xffd5fa18, allocated=0xffd5fa14, reserved=0xffd5fa10, down=0xffd5fa0c) at ../Libifl/pbsD_resc.c:215 #1 0x08054ff0 in check_nodes (pbs_sd=0, jinfo=0x8d0ac68, ninfo_arr=0x0) at check.c:507 #2 0x0805523c in is_ok_to_run_job (pbs_sd=0, sinfo=0x88089f0, qinfo=0x882e430, jinfo=0x8d0ac68) at check.c:185 #3 0x0804ce75 in scheduling_cycle (sd=0) at fifo.c:486 #4 0x0804d119 in schedule (cmd=-2754024, sd=0) at fifo.c:383 #5 0x0804bc24 in main (argc=1, argv=0xffd5fe74) at pbs_sched.c:1220 It appears that pbs_rescquery() is expecting arrays for available, allocated, reserved, and down, but check_nodes() is sending it ints instead. I'm using TORQUE 2.4.16. -- Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From bugzilla-daemon at supercluster.org Thu Sep 1 14:11:53 2011 From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org) Date: Thu, 1 Sep 2011 14:11:53 -0600 (MDT) Subject: [torquedev] [Bug 156] FIFO pbs_sched crash in check_nodes In-Reply-To: References: Message-ID: <20110901201153.5649A678263@http.supercluster.org> http://www.clusterresources.com/bugzilla/show_bug.cgi?id=156 --- Comment #1 from Ari Pollak 2011-09-01 14:11:53 MDT --- Actually, passing int by reference doesn't really matter since num_resc=1. Upon slightly further investigation, the problem is this: (gdb) print reply->brp_un $3 = {brp_jid = '\0' , brp_select = 0x0, brp_status = {ll_prior = 0x0, ll_next = 0x0, ll_struct = 0x0}, brp_statc = 0x0, brp_txt = {brp_txtlen = 0, brp_str = 0x0}, brp_locate = '\0' , brp_rescq = {brq_number = 0, brq_avail = 0x0, brq_alloc = 0x0, brq_resvd = 0x0, brq_down = 0x0}} (gdb) print reply->brp_un.brp_rescq.brq_avail $4 = (int *) 0x0 -- Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From mej at lbl.gov Thu Sep 1 16:30:30 2011 From: mej at lbl.gov (Michael Jennings) Date: Thu, 1 Sep 2011 15:30:30 -0700 Subject: [torquedev] Autoconf and TORQUE In-Reply-To: <63b736cf-a0bd-4a9d-a554-94f0bf7cde80@mail> References: <677cf157-4407-4288-b68a-937654063ce1@mail> <63b736cf-a0bd-4a9d-a554-94f0bf7cde80@mail> Message-ID: <20110901223028.GD15835@lbl.gov> On Thursday, 01 September 2011, at 10:22:10 (-0600), Ken Nielson wrote: > The system we use to build TORQUE uses autoconf version 2.59 and > automake 1.9.2. > > The version of autoconf and automake do not matter to TORQUE 3.0.x > but it does make a difference with 2.4.x and 2.5.x because we > deliver the Makefile.in files with the build. > > My question is, does anyone depend on TORQUE building with autoconf > 2.59 or automake 1.9.2? Can we upgrade to newer versions? Anyone can regenerate those files at any time by re-running autoconf, automake, etc. Unless you actually use macros which don't exist in the older versions, it shouldn't matter at all. Just don't let the upgrading of your local build tools tempt you to use the more modern (and thus not backward compatible) macros. :) Michael -- Michael Jennings Linux Systems and Cluster Engineer High-Performance Computing Services Bldg 50B-3209E W: 510-495-2687 MS 050C-3396 F: 510-486-8615 From samuel at unimelb.edu.au Thu Sep 1 19:46:15 2011 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Fri, 02 Sep 2011 11:46:15 +1000 Subject: [torquedev] Environment variable for PBS script name ? In-Reply-To: <72d7ebc0-6b9c-408a-bc3b-49f2e858ff3e@mail> References: <72d7ebc0-6b9c-408a-bc3b-49f2e858ff3e@mail> Message-ID: <4E603567.1090902@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 01/09/11 03:06, Ken Nielson wrote: > Do they want the full path of the script from the > submission host or the path where the script resides > on the execution host? On the submission host please. For the interim I've suggested they pull the script name out with qstat -f and then combine that with $PBS_O_WORKDIR.. cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk5gNWcACgkQO2KABBYQAh8ULACghFNYDZfdrTgpBW6mxFhN1Ug+ XfUAn1rHnPOPw6I/nA9WtYqeVSD1/sXn =93eF -----END PGP SIGNATURE----- From knielson at adaptivecomputing.com Fri Sep 2 06:34:28 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Fri, 02 Sep 2011 06:34:28 -0600 (MDT) Subject: [torquedev] Environment variable for PBS script name ? In-Reply-To: <4E603567.1090902@unimelb.edu.au> Message-ID: <0b53ba31-9593-4e58-85e1-01af926e8b5e@mail> ----- Original Message ----- > From: "Christopher Samuel" > To: torquedev at supercluster.org > Sent: Thursday, September 1, 2011 7:46:15 PM > Subject: Re: [torquedev] Environment variable for PBS script name ? > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 01/09/11 03:06, Ken Nielson wrote: > > > Do they want the full path of the script from the > > submission host or the path where the script resides > > on the execution host? > > On the submission host please. > > For the interim I've suggested they pull the script > name out with qstat -f and then combine that with > $PBS_O_WORKDIR.. > > cheers, > Chris > - -- > Christopher Samuel - Senior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.unimelb.edu.au/ > Chris, Would you put a request for this in Bugzilla? Regards Ken From knielson at adaptivecomputing.com Fri Sep 2 10:53:55 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Fri, 02 Sep 2011 10:53:55 -0600 (MDT) Subject: [torquedev] TORQUE 2.5.8 released In-Reply-To: <55d70021-d8fa-4625-b175-2618f4cadb39@mail> Message-ID: TORQUE 2.5.8 is now available for general release. There were no new features added in this release but there were some notable bug fixes. Among them was a fix for the queue resource procct. This resource is intended to be internal to TORQUE. The problem occurred when a job could not be routed to an execution queue and had to be put in a routing queue. The procct count would be interpreted by Moab and Maui as a generic resource. Since the procct generic resource did not exist the job would be stuck. This has been fixed. Another fix was for NVIDIA gpu mode setting. If exclusive_process or default were requested as a GPU mode and TORQUE was using a scheduler the mode would not be changed. This was due to the scheduler stripping of the mode designation. This problem has now been fixed. However, to be able to use the default mode will require the addition of a property to the nodes file of "default" until a new release of Moab is available. The version is at this time not known. Please see the CHANGELOG for all bugs fixed in this version of TORQUE. The release can be downloaded at http://www.clusterresources.com/downloads/torque/torque-2.5.8.tar.gz Thanks to everyone who has made this build possible. Ken Nielson Adaptive Computing From bugzilla-daemon at supercluster.org Fri Sep 2 13:40:49 2011 From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org) Date: Fri, 2 Sep 2011 13:40:49 -0600 (MDT) Subject: [torquedev] [Bug 157] New: Patches for OS X 10.5.8 Message-ID: http://www.clusterresources.com/bugzilla/show_bug.cgi?id=157 Summary: Patches for OS X 10.5.8 Product: TORQUE Version: 3.0.x Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: normal Priority: P5 Component: pbs_server AssignedTo: dbeer at adaptivecomputing.com ReportedBy: nathanweeks at yahoo.com CC: torquedev at supercluster.org Estimated Hours: 0.0 TORQUE 3.0.2 needed a few source-code modifications to compile and run on OS X 10.5.8; patches are attached. A brief summary of the issues fixed: u_threadpool.c: * All versions of OS X (through 10.7) lack clock_gettime(); this patch works around this by detecting if clock_gettime() exists in configure.ac, and if not, use gettimeofday() as suggested in the OS X 10.6/10.7 manual page for pthread_cond_timedwait() configure.ac: * There is a compilation error when linking librt, as OS X doesn't have such a library; the suggested workaround to link with librt only if clock_gettime() requires it (which it doesn't on some platforms, e.g., FreeBSD). This is in addition to the previously-mentioned check for the existence of clock_gettime(). pbs_mkdirs.in: * "echo" is a bash builtin, and when bash is invoked as "/bin/sh", the "-e" option loses its special meaning, causing an "-e" to appear at the beginning of the first line in the default server_priv/nodes. -- Nathan Weeks IT Specialist USDA-ARS http://www.public.iastate.edu/~weeks -- Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From bugzilla-daemon at supercluster.org Fri Sep 2 13:41:36 2011 From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org) Date: Fri, 2 Sep 2011 13:41:36 -0600 (MDT) Subject: [torquedev] [Bug 157] Patches for OS X 10.5.8 In-Reply-To: References: Message-ID: <20110902194136.82C4E4121400@http.supercluster.org> http://www.clusterresources.com/bugzilla/show_bug.cgi?id=157 --- Comment #1 from Nathan Weeks 2011-09-02 13:41:36 MDT --- Created an attachment (id=94) --> (http://www.clusterresources.com/bugzilla/attachment.cgi?id=94) patch -- Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From bugzilla-daemon at supercluster.org Fri Sep 2 13:42:09 2011 From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org) Date: Fri, 2 Sep 2011 13:42:09 -0600 (MDT) Subject: [torquedev] [Bug 157] Patches for OS X 10.5.8 In-Reply-To: References: Message-ID: <20110902194209.9597B4121401@http.supercluster.org> http://www.clusterresources.com/bugzilla/show_bug.cgi?id=157 Nathan Weeks changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #94|patch |pbs_mkdirs.in.patch description| | -- Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From bugzilla-daemon at supercluster.org Fri Sep 2 13:42:46 2011 From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org) Date: Fri, 2 Sep 2011 13:42:46 -0600 (MDT) Subject: [torquedev] [Bug 157] Patches for OS X 10.5.8 In-Reply-To: References: Message-ID: <20110902194246.E8A7F4121403@http.supercluster.org> http://www.clusterresources.com/bugzilla/show_bug.cgi?id=157 --- Comment #2 from Nathan Weeks 2011-09-02 13:42:46 MDT --- Created an attachment (id=95) --> (http://www.clusterresources.com/bugzilla/attachment.cgi?id=95) u_threadpool.c.patch -- Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From bugzilla-daemon at supercluster.org Fri Sep 2 13:43:25 2011 From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org) Date: Fri, 2 Sep 2011 13:43:25 -0600 (MDT) Subject: [torquedev] [Bug 157] Patches for OS X 10.5.8 In-Reply-To: References: Message-ID: <20110902194325.0E50F4121405@http.supercluster.org> http://www.clusterresources.com/bugzilla/show_bug.cgi?id=157 --- Comment #3 from Nathan Weeks 2011-09-02 13:43:24 MDT --- Created an attachment (id=96) --> (http://www.clusterresources.com/bugzilla/attachment.cgi?id=96) configure.ac.patch -- Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From bugzilla-daemon at supercluster.org Sun Sep 4 15:39:41 2011 From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org) Date: Sun, 4 Sep 2011 15:39:41 -0600 (MDT) Subject: [torquedev] [Bug 158] New: [2.5.8] suse.pbs_mom uses never touched file /var/lock/subsys/pbs_mom Message-ID: http://www.clusterresources.com/bugzilla/show_bug.cgi?id=158 Summary: [2.5.8] suse.pbs_mom uses never touched file /var/lock/subsys/pbs_mom Product: TORQUE Version: 2.5.x Platform: PC OS/Version: Linux Status: NEW Severity: enhancement Priority: P5 Component: pbs_mom AssignedTo: knielson at adaptivecomputing.com ReportedBy: burnus at net-b.de CC: torquedev at supercluster.org Estimated Hours: 0.0 On SUSE Linux, the directory /var/lock/subsys/ does not exist, the file is never touched (contrary to contrib/init.d/pbs_mom) and I get the following rpmlint warning: torque-mom.x86_64: W: subsys-unsupported /etc/init.d/pbs_mom The init script uses /var/lock/subsys which is not supported by this distribution. I think the following patch should be correct: --- contrib/init.d/suse.pbs_mom.orig 2011-09-04 23:36:11.000000000 +0200 +++ contrib/init.d/suse.pbs_mom 2011-09-04 23:36:16.000000000 +0200 @@ -47,7 +47,7 @@ rc_status -v ;; purge) - [ -f /var/lock/subsys/pbs_mom ] && $0 stop + checkproc -p $PIDFILE $PBS_DAEMON && $0 stop echo -n "Starting TORQUE Mom with purge: " startproc $PBS_DAEMON -r $DAEMON_ARGS rc_status -v -- Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From bugzilla-daemon at supercluster.org Sun Sep 4 15:58:33 2011 From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org) Date: Sun, 4 Sep 2011 15:58:33 -0600 (MDT) Subject: [torquedev] [Bug 158] [2.5.8] suse.pbs_mom uses never touched file /var/lock/subsys/pbs_mom In-Reply-To: References: Message-ID: <20110904215833.CC07767885E@http.supercluster.org> http://www.clusterresources.com/bugzilla/show_bug.cgi?id=158 --- Comment #1 from Tobias Burnus 2011-09-04 15:58:33 MDT --- While patching files, could you also apply the following patch, which silences the warning: torque.x86_64: W: incorrect-fsf-address /usr/share/doc/packages/torque/contrib/pbstop The Free Software Foundation address in this file seems to be outdated or misspelled. --- contrib/pbstop.orig 2011-09-04 23:52:58.000000000 +0200 +++ contrib/pbstop 2011-09-04 23:56:22.000000000 +0200 @@ -14,7 +14,7 @@ # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software -# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA +# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. # # Latest version of this software may be found at: # http://www-rcf.usc.edu/~garrick/perl-PBS -- Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From siegert at sfu.ca Wed Sep 7 17:44:31 2011 From: siegert at sfu.ca (Martin Siegert) Date: Wed, 7 Sep 2011 16:44:31 -0700 Subject: [torquedev] procs resource specification peculiarity Message-ID: <20110907234431.GA6865@stikine.sfu.ca> Hi, I vaguely remeber a discussion on this list about whether requests of the form -l nodes=x:ppn=y+procs=z should be allowed and as far as I remember the common response was: no. However, the code in qsub.c looks really strange to me: case 'l': l_opt = passet; /* a ,procs= in the node spec is illegal. Validate the node spec */ if(strstr(optarg, ",procs=")) { printf("illegal node spec: %s\n", optarg); return(-1); } First of all the comment is strange: what has this to do with the node spec? Instead, this has the bizarre effect that job submissions like qsub -l walltime=24:00:00,procs=42 job.pbs are rejected with an "illegal node spec" error message, whereas the supposedly equivalent submission qsub -l procs=42,walltime=24:00:00 job.pbs is accepted. This looks like a bug to me, correct? At least it is highly confusing for our users. It is my understanding that a "," in a -l argument should be totally equivalent to another -l argument, i.e., -l spec1,spec2,spec3 should be equivalent to -l spec1 -l spec2 -l spec3 correct? If my recollection is correct and we agreed that -l nodes=x:ppn=y+procs=z should be rejected while -l nodes=x:ppn -l procs=z should be used instead then even -l nodes=x:ppn=y,procs=z should be accepted as well. Instead, the code in qsub.c appears to accept -l nodes=x:ppn=y+procs=z while it rejects any spec that contains ",procs". Just the opposite of what I thought we had discussed. Cheers, Martin -- Martin Siegert Simon Fraser University From cwest at vpac.org Wed Sep 7 21:30:57 2011 From: cwest at vpac.org (Craig West) Date: Thu, 08 Sep 2011 13:30:57 +1000 Subject: [torquedev] TORQUE 2.5.8 released In-Reply-To: References: Message-ID: <4E6836F1.4020508@vpac.org> Hi Ken, > Please see the CHANGELOG for all bugs fixed in this version of TORQUE. > > The release can be downloaded at http://www.clusterresources.com/downloads/torque/torque-2.5.8.tar.gz Seems we still need to download the entire package just to read the CHANGELOG. The change logs were being put in the location below for a while, can we please continue to do that? http://www.clusterresources.com/downloads/torque/CHANGELOGS/ Thanks, Craig. -- Craig West Systems Manager Victorian Partnership for Advanced Computing 110 Victoria Street, Carlton South VIC 3053 P: +61 3 9925 4751 E: cwest at vpac.org http://www.vpac.org From knielson at adaptivecomputing.com Thu Sep 8 07:53:41 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Thu, 08 Sep 2011 07:53:41 -0600 (MDT) Subject: [torquedev] TORQUE 2.5.8 released In-Reply-To: <4E6836F1.4020508@vpac.org> Message-ID: ----- Original Message ----- > From: "Craig West" > To: torquedev at supercluster.org > Sent: Wednesday, September 7, 2011 9:30:57 PM > Subject: Re: [torquedev] TORQUE 2.5.8 released > > > Hi Ken, > > > Please see the CHANGELOG for all bugs fixed in this version of > > TORQUE. > > > > The release can be downloaded at > > http://www.clusterresources.com/downloads/torque/torque-2.5.8.tar.gz > > Seems we still need to download the entire package just to read the > CHANGELOG. The change logs were being put in the location below for a > while, can we please continue to do that? > > http://www.clusterresources.com/downloads/torque/CHANGELOGS/ > > Thanks, > Craig. > > -- > Craig West Systems Manager > Victorian Partnership for Advanced Computing > 110 Victoria Street, Carlton South VIC 3053 > P: +61 3 9925 4751 E: cwest at vpac.org > http://www.vpac.org Craig, Thanks for pointing that out. Sorry. It will be up shortly. Ken From knielson at adaptivecomputing.com Fri Sep 9 16:54:05 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Fri, 09 Sep 2011 16:54:05 -0600 (MDT) Subject: [torquedev] auto tool problem In-Reply-To: Message-ID: <9de50321-101c-403c-a96d-785875862567@mail> Hi all, We are having a problem with configure.ac and autoconf for the trunk when building on CentOS 5 and CentOS 3. Autoconf produces the following error. configure.ac:49: error: possibly undefined macro: AC_MSG_ERROR If this token and others are legitimate, please use m4_pattern_allow. See the Autoconf documentation. This compiles fine on Ubuntu versions 10 and 11. We have used autoconf version 2.65 and 2.67. The libtool version is 2.4 and automake is version 1.11. On the CentOS we tried the above versions as well as autoconf version 2.59 I also tried this in CentOS 6. It builds successfully. Thanks for the help. You can use subversion to check out the trunk at svn://clusterresources.com/torque/trunk. Regards Ken Nielson Adaptive Computing From knielson at adaptivecomputing.com Thu Sep 15 12:19:40 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Thu, 15 Sep 2011 12:19:40 -0600 (MDT) Subject: [torquedev] auto tool problem In-Reply-To: <9de50321-101c-403c-a96d-785875862567@mail> Message-ID: ----- Original Message ----- > From: "Ken Nielson" > To: "Torque Developers mailing list" > Sent: Friday, September 9, 2011 4:54:05 PM > Subject: [torquedev] auto tool problem > > Hi all, > > We are having a problem with configure.ac and autoconf for the trunk > when building on CentOS 5 and CentOS 3. > > Autoconf produces the following error. > > configure.ac:49: error: possibly undefined macro: AC_MSG_ERROR > If this token and others are legitimate, please use > m4_pattern_allow. > See the Autoconf documentation. > > This compiles fine on Ubuntu versions 10 and 11. We have used > autoconf version 2.65 and 2.67. The libtool version is 2.4 and > automake is version 1.11. > > Hi all, It turns out that this is not an OS problem at all. The problem was the hwloc was not installed on the machines where we were building TORQUE. So another possible solution to the many I found out on the web concerning the autoconf error : possibly undefined macro: issue is that a library may be missing. Regards Ken Nielson From mej at lbl.gov Thu Sep 15 12:47:47 2011 From: mej at lbl.gov (Michael Jennings) Date: Thu, 15 Sep 2011 11:47:47 -0700 Subject: [torquedev] auto tool problem In-Reply-To: References: <9de50321-101c-403c-a96d-785875862567@mail> Message-ID: <20110915184744.GZ10643@lbl.gov> On Thursday, 15 September 2011, at 12:19:40 (-0600), Ken Nielson wrote: > It turns out that this is not an OS problem at all. The problem was > the hwloc was not installed on the machines where we were building > TORQUE. Wow. I might've given up before figuring that one out. :) I've been trying to reproduce the problem here without much success. Glad you got it resolved! Michael -- Michael Jennings Linux Systems and Cluster Engineer High-Performance Computing Services Bldg 50B-3209E W: 510-495-2687 MS 050C-3396 F: 510-486-8615 From pablo.fernandez at cscs.ch Fri Sep 16 09:35:38 2011 From: pablo.fernandez at cscs.ch (Pablo Fernandez) Date: Fri, 16 Sep 2011 17:35:38 +0200 Subject: [torquedev] possible bug in qstat -f Message-ID: <201109161735.38162.pablo.fernandez@cscs.ch> Dear devels, We have been fighting with a third-party script that depends on the output of qstat -f... and it seems the output format has changed slightly, and the script does not work anymore. We have changed the script to make it work, but as I mentioned, is 3-party, so no further updates are possible. Besides, it really seems to me like a typo somewhere in the pbs client code (2.4.16), so I thought it may be worth sharing it here. So, if you type qstat -f and you show all special characters, you get this: Job Id: 5304000.lrms02.lcg.cscs.ch$ Job_Name = cre01_914471456$ Job_Owner = lhcbprd01 at cream01.lcg.cscs.ch$ job_state = Q$ queue = lhcb$ server = lrms02.lcg.cscs.ch$ Checkpoint = u$ ctime = Sat Sep 10 00:10:41 2011$ Error_Path = cream01.lcg.cscs.ch:/dev/null$ Output_Path = cream01.lcg.cscs.ch:/dev/null$ Priority = 0$ [blabla] euser = lhcbprd01$ egroup = lhcb$ hashname = 5304000.lrms02.lcg.cscs.ch$ queue_rank = 1300167$ queue_type = E$ comment = job rejected by RM 'lrms02' - job started on hostlist wn176.lcg.$ ^Icscs.ch at time 15:20:19_09/16,$ ^I job reported idle at time 15:23:14_09/16 (see RM logs for details)$ $ etime = Fri Sep 16 15:50:32 2011$ submit_args = /tmp/cre01_914471456$ fault_tolerant = False$ $ Job Id: 5304708.lrms02.lcg.cscs.ch$ (the next job continues...) So, if you take a look closely, the "comment=" line finishes with two returns, where all the other fields finish with one. So, this means any parser that expects a blank line to separate two jobs, will fail. I guess the comment line was introduced in the 2.4 series, because the scripts were designed for 2.3, and they don't fail there. Is this actually a bug that could be fixed within the same series (2.4)? Thanks a lot! BR/Pablo -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torquedev/attachments/20110916/d52e9527/attachment.html From samuel at unimelb.edu.au Sun Sep 18 23:18:00 2011 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Mon, 19 Sep 2011 15:18:00 +1000 Subject: [torquedev] possible bug in qstat -f In-Reply-To: <201109161735.38162.pablo.fernandez@cscs.ch> References: <201109161735.38162.pablo.fernandez@cscs.ch> Message-ID: <4E76D088.3070708@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 17/09/11 01:35, Pablo Fernandez wrote: > So, if you take a look closely, the "comment=" line finishes with two > returns, where all the other fields finish with one. So, this means any > parser that expects a blank line to separate two jobs, will fail. I don't see that here (2.4.16), looking at one of our jobs with a comment set by Moab 6.1.1 in a hex editor only shows one new line character (0A). Could it be that your Moab is setting a comment with an extra new line at the end ? cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk520IgACgkQO2KABBYQAh9iOgCgj2CoqN6xZYYihjBVHp3nTI2W MRIAn0CrE8bhv5DTU24GdwR1WeflyW57 =Xpuu -----END PGP SIGNATURE----- From pablo.fernandez at cscs.ch Mon Sep 19 03:19:57 2011 From: pablo.fernandez at cscs.ch (Pablo Fernandez) Date: Mon, 19 Sep 2011 11:19:57 +0200 Subject: [torquedev] possible bug in qstat -f In-Reply-To: <4E76D088.3070708@unimelb.edu.au> References: <201109161735.38162.pablo.fernandez@cscs.ch> <4E76D088.3070708@unimelb.edu.au> Message-ID: <201109191119.57691.pablo.fernandez@cscs.ch> Hi, > > So, if you take a look closely, the "comment=" line finishes with two > > returns, where all the other fields finish with one. So, this means any > > parser that expects a blank line to separate two jobs, will fail. > > I don't see that here (2.4.16), looking at one of our > jobs with a comment set by Moab 6.1.1 in a hex editor > only shows one new line character (0A). > > Could it be that your Moab is setting a comment with > an extra new line at the end ? I guess it could be... I have just upgraded to Moab 6.1.2, and the comments are still there, with the extra CR, but maybe they are old comments that torque keeps. I will have to wait for new comments, I guess. Thanks! Pablo -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torquedev/attachments/20110919/b103338f/attachment.html From jayavant.patil82 at gmail.com Mon Sep 19 05:20:28 2011 From: jayavant.patil82 at gmail.com (Jayavant Patil) Date: Mon, 19 Sep 2011 16:50:28 +0530 Subject: [torquedev] Automatic Job REQUEUE Message-ID: Hi, I am using TORQUE 3.0.0 and Maui 3.3. When the compute node (on which the job is running) fails due to crash, power failure or any other reason, how the job which was running on that compute node should get *automatically * requeued? (I am aware that with *qrerun* we can manually rerun the job but I don't want this) Thanks in advance. -- Thanks & Regards, Jayavant N. Patil -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torquedev/attachments/20110919/30cb1ef1/attachment-0001.html From s.prabhakaran at grs-sim.de Tue Sep 27 06:14:20 2011 From: s.prabhakaran at grs-sim.de (Suraj Prabhakaran) Date: Tue, 27 Sep 2011 14:14:20 +0200 Subject: [torquedev] Documentation? In-Reply-To: <201109191119.57691.pablo.fernandez@cscs.ch> References: <201109161735.38162.pablo.fernandez@cscs.ch> <4E76D088.3070708@unimelb.edu.au> <201109191119.57691.pablo.fernandez@cscs.ch> Message-ID: <2DDA3FD4-AE99-449C-AA94-6BD5163301F9@grs-sim.de> Hello all, Are there any design documentation of PBS/Torque (other than the admin guide) available somewhere? If so, could anyone please point me to that? I have been able to find the old PBS (2.2) external reference specification, internal design specification, and requirements specification. Haven't been able to find the external design specification. Is there any documentation available that delves into design? Thank you, Suraj From l.flis at cyf-kr.edu.pl Tue Sep 27 17:15:34 2011 From: l.flis at cyf-kr.edu.pl (Lukasz Flis) Date: Wed, 28 Sep 2011 01:15:34 +0200 Subject: [torquedev] Torque 2.5.8 - memory leaks Message-ID: <4E825916.2000802@cyf-kr.edu.pl> Hi, We are running medium cluster with Torque and Moab, Average number of jobs is usually around 4k Number of nodes: 950 Number of cores (at the moment) 10k We have recently migrated from 2.4.12 to 2.5.8. Unfortunately we are observing torque memory issues. I'm attaching daily graph from our ganglia monitoring system. On average it takes 3 hours for torque to consume 8 Gbytes of memory. Afterwards pbs_server daemon needs to be restarted as it is unable to perform fork operation due to lack of available memory, then OOM killer gets in action. pbs_server: LOG_ERROR::Cannot allocate memory (12) in send_job, fork failed Due to the scale of the system I am unable to run pbs_server under valgrind to find the source of leak. I did some testing on our test cluster but i'm not sure how accurate results valgrind provides: a lot of messages point to decode_str function: ==31895== 41 bytes in 4 blocks are definitely lost in loss record 40 of 81 ==31895== at 0x4A05E1C: malloc (vg_replace_malloc.c:195) ==31895== by 0x452BD8: decode_str (attr_fn_str.c:144) ==31895== by 0x40A768: recov_attr (attr_recov.c:512) ==31895== by 0x44909C: svr_recov (svr_recov.c:204) ==31895== by 0x41F4A5: get_svr_attr (pbsd_init.c:2387) ==31895== by 0x4213A2: main (pbsd_main.c:1465) so as to decode_arst_direct: ==31895== 143 (56 direct, 87 indirect) bytes in 1 blocks are definitely lost in loss record 55 of 84 ==31895== at 0x4A05E1C: malloc (vg_replace_malloc.c:195) ==31895== by 0x44F1B8: decode_arst_direct (attr_fn_arst.c:189) ==31895== by 0x44F3DE: decode_arst (attr_fn_arst.c:311) ==31895== by 0x40A768: recov_attr (attr_recov.c:512) ==31895== by 0x44909C: svr_recov (svr_recov.c:204) ==31895== by 0x41F4A5: get_svr_attr (pbsd_init.c:2387) ==31895== by 0x4213A2: main (pbsd_main.c:1465) I am going to dig the sources a bit and see if memory allocated by above functions is freed properly. However any suggestions and hints will be welcome as I might be unable to fix it all myself. Thank you for attention -- Lukasz Flis -------------- next part -------------- A non-text attachment was scrubbed... Name: leaking-torque.png Type: image/png Size: 58057 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torquedev/attachments/20110928/1a3bb9a0/attachment-0001.png From Gareth.Williams at csiro.au Wed Sep 28 16:00:13 2011 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Thu, 29 Sep 2011 08:00:13 +1000 Subject: [torquedev] Documentation? In-Reply-To: <2DDA3FD4-AE99-449C-AA94-6BD5163301F9@grs-sim.de> References: <201109161735.38162.pablo.fernandez@cscs.ch> <4E76D088.3070708@unimelb.edu.au> <201109191119.57691.pablo.fernandez@cscs.ch> <2DDA3FD4-AE99-449C-AA94-6BD5163301F9@grs-sim.de> Message-ID: <007DECE986B47F4EABF823C1FBB19C620102B6D6AE4F@exvic-mbx04.nexus.csiro.au> > -----Original Message----- > From: torquedev-bounces at supercluster.org [mailto:torquedev- > bounces at supercluster.org] On Behalf Of Suraj Prabhakaran > Sent: Tuesday, 27 September 2011 10:14 PM > To: Torque Developers mailing list > Subject: [torquedev] Documentation? > > Hello all, > > Are there any design documentation of PBS/Torque (other than the admin > guide) available somewhere? If so, could anyone please point me to > that? > I have been able to find the old PBS (2.2) external reference > specification, internal design specification, and requirements > specification. Haven't been able to find the external design > specification. Is there any documentation available that delves into > design? > > Thank you, > Suraj > _______________________________________________ > torquedev mailing list > torquedev at supercluster.org > http://www.supercluster.org/mailman/listinfo/torquedev Hi Suraj, These posts have (broken) links to diagrams of job states. I recall they were good though they may not be what you want. http://www.supercluster.org/pipermail/torqueusers/2008-January/006706.html http://www.supercluster.org/pipermail/torqueusers/2007-June/005724.html Gareth From knielson at adaptivecomputing.com Wed Sep 28 16:00:32 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Wed, 28 Sep 2011 16:00:32 -0600 (MDT) Subject: [torquedev] Torque 2.5.8 - memory leaks In-Reply-To: <4E825916.2000802@cyf-kr.edu.pl> Message-ID: <10234caf-bf9b-47c1-9ad9-e75ad00b9052@mail> ----- Original Message ----- > From: "Lukasz Flis" > To: "Torque Developers mailing list" > Sent: Tuesday, September 27, 2011 5:15:34 PM > Subject: [torquedev] Torque 2.5.8 - memory leaks > > Hi, > > We are running medium cluster with Torque and Moab, > Average number of jobs is usually around 4k > Number of nodes: 950 > Number of cores (at the moment) 10k > > We have recently migrated from 2.4.12 to 2.5.8. Unfortunately we are > observing torque memory issues. I'm attaching daily graph from our > ganglia monitoring system. On average it takes 3 hours for torque to > consume 8 Gbytes of memory. Afterwards pbs_server daemon needs to be > restarted as it is unable to perform fork operation due to lack of > available memory, then OOM killer gets in action. > > pbs_server: LOG_ERROR::Cannot allocate memory (12) in send_job, > fork > failed > > Due to the scale of the system I am unable to run pbs_server under > valgrind to find the source of leak. I did some testing on our test > cluster but i'm not sure how accurate results valgrind provides: > > a lot of messages point to decode_str function: > ==31895== 41 bytes in 4 blocks are definitely lost in loss record 40 > of 81 > ==31895== at 0x4A05E1C: malloc (vg_replace_malloc.c:195) > ==31895== by 0x452BD8: decode_str (attr_fn_str.c:144) > ==31895== by 0x40A768: recov_attr (attr_recov.c:512) > ==31895== by 0x44909C: svr_recov (svr_recov.c:204) > ==31895== by 0x41F4A5: get_svr_attr (pbsd_init.c:2387) > ==31895== by 0x4213A2: main (pbsd_main.c:1465) > > so as to decode_arst_direct: > > ==31895== 143 (56 direct, 87 indirect) bytes in 1 blocks are > definitely > lost in loss record 55 of 84 > ==31895== at 0x4A05E1C: malloc (vg_replace_malloc.c:195) > ==31895== by 0x44F1B8: decode_arst_direct (attr_fn_arst.c:189) > ==31895== by 0x44F3DE: decode_arst (attr_fn_arst.c:311) > ==31895== by 0x40A768: recov_attr (attr_recov.c:512) > ==31895== by 0x44909C: svr_recov (svr_recov.c:204) > ==31895== by 0x41F4A5: get_svr_attr (pbsd_init.c:2387) > ==31895== by 0x4213A2: main (pbsd_main.c:1465) > > > I am going to dig the sources a bit and see if memory allocated by > above > functions is freed properly. > However any suggestions and hints will be welcome as I might be > unable > to fix it all myself. > > Thank you for attention > -- > Lukasz Flis > > What scheduler are you using? Ken Nielson From toth at fi.muni.cz Wed Sep 28 16:27:46 2011 From: toth at fi.muni.cz (=?windows-1252?Q?=22Mgr=2E_=8Aimon_T=F3th=22?=) Date: Thu, 29 Sep 2011 00:27:46 +0200 Subject: [torquedev] Torque 2.5.8 - memory leaks In-Reply-To: <10234caf-bf9b-47c1-9ad9-e75ad00b9052@mail> References: <10234caf-bf9b-47c1-9ad9-e75ad00b9052@mail> Message-ID: <4E839F62.1060901@fi.muni.cz> > What scheduler are you using? How is that relevant? -- Mgr. Simon Toth From l.flis at cyf-kr.edu.pl Wed Sep 28 16:29:15 2011 From: l.flis at cyf-kr.edu.pl (Lukasz Flis) Date: Thu, 29 Sep 2011 00:29:15 +0200 Subject: [torquedev] Torque 2.5.8 - memory leaks In-Reply-To: <10234caf-bf9b-47c1-9ad9-e75ad00b9052@mail> References: <10234caf-bf9b-47c1-9ad9-e75ad00b9052@mail> Message-ID: <201109290029.15266.l.flis@cyf-kr.edu.pl> Hello Ken, On Thursday 29 September 2011 00:00:32 Ken Nielson wrote: > ----- Original Message ----- > > > From: "Lukasz Flis" > > To: "Torque Developers mailing list" > > Sent: Tuesday, September 27, 2011 5:15:34 PM > > Subject: [torquedev] Torque 2.5.8 - memory leaks > > > > Hi, > > > > We are running medium cluster with Torque and Moab, > > Average number of jobs is usually around 4k > > Number of nodes: 950 > > Number of cores (at the moment) 10k > > > > We have recently migrated from 2.4.12 to 2.5.8. Unfortunately we are > > observing torque memory issues. I'm attaching daily graph from our > > ganglia monitoring system. On average it takes 3 hours for torque to > > consume 8 Gbytes of memory. Afterwards pbs_server daemon needs to be > > restarted as it is unable to perform fork operation due to lack of > > available memory, then OOM killer gets in action. > > > > pbs_server: LOG_ERROR::Cannot allocate memory (12) in send_job, > > fork > > failed > > > > Due to the scale of the system I am unable to run pbs_server under > > valgrind to find the source of leak. I did some testing on our test > > cluster but i'm not sure how accurate results valgrind provides: > > > > a lot of messages point to decode_str function: > > ==31895== 41 bytes in 4 blocks are definitely lost in loss record 40 > > of 81 > > ==31895== at 0x4A05E1C: malloc (vg_replace_malloc.c:195) > > ==31895== by 0x452BD8: decode_str (attr_fn_str.c:144) > > ==31895== by 0x40A768: recov_attr (attr_recov.c:512) > > ==31895== by 0x44909C: svr_recov (svr_recov.c:204) > > ==31895== by 0x41F4A5: get_svr_attr (pbsd_init.c:2387) > > ==31895== by 0x4213A2: main (pbsd_main.c:1465) > > > > so as to decode_arst_direct: > > > > ==31895== 143 (56 direct, 87 indirect) bytes in 1 blocks are > > definitely > > lost in loss record 55 of 84 > > ==31895== at 0x4A05E1C: malloc (vg_replace_malloc.c:195) > > ==31895== by 0x44F1B8: decode_arst_direct (attr_fn_arst.c:189) > > ==31895== by 0x44F3DE: decode_arst (attr_fn_arst.c:311) > > ==31895== by 0x40A768: recov_attr (attr_recov.c:512) > > ==31895== by 0x44909C: svr_recov (svr_recov.c:204) > > ==31895== by 0x41F4A5: get_svr_attr (pbsd_init.c:2387) > > ==31895== by 0x4213A2: main (pbsd_main.c:1465) > > > > > > I am going to dig the sources a bit and see if memory allocated by > > above > > functions is freed properly. > > However any suggestions and hints will be welcome as I might be > > unable > > to fix it all myself. > > > > Thank you for attention > > -- > > Lukasz Flis > > What scheduler are you using? We are using MOAB 6.1.1. server scheduling is set to True. We had similar problems with 2.4.12 and Maui but amount of consumed/leaked memory was smaller as Torque was 32 bit version. > Ken Nielson > _______________________________________________ > torquedev mailing list > torquedev at supercluster.org > http://www.supercluster.org/mailman/listinfo/torquedev >