From samuel at unimelb.edu.au Sun Apr 1 19:27:08 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Mon, 02 Apr 2012 11:27:08 +1000 Subject: [torquedev] Torque SVN server down ? Message-ID: <4F79006C.7040207@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 For a week now (since last Monday at least) the Torque SVN server has seemed to be down. When I try and update my local git clone I get: samuel at eris:~/Downloads/Torque$ git svn rebase Connection timed out: Can't connect to host 'svn.clusterresources.com': Connection timed out at /usr/lib/git-core/git-svn line 2322 cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk95AGsACgkQO2KABBYQAh+Z0QCdEcuPvqaLa10RZK02/dUbmWql vogAnjN7u7UZC3SV8LrjAk06offRGFDf =tBBX -----END PGP SIGNATURE----- From knielson at adaptivecomputing.com Mon Apr 2 11:24:32 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Mon, 2 Apr 2012 11:24:32 -0600 Subject: [torquedev] Torque SVN server down ? In-Reply-To: <4F79006C.7040207@unimelb.edu.au> References: <4F79006C.7040207@unimelb.edu.au> Message-ID: Chris, Just fixed the problem. It was a firewall issue. Ken On Sun, Apr 1, 2012 at 7:27 PM, Christopher Samuel wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > For a week now (since last Monday at least) the Torque SVN > server has seemed to be down. When I try and update my local > git clone I get: > > samuel at eris:~/Downloads/Torque$ git svn rebase > Connection timed out: Can't connect to host > 'svn.clusterresources.com': Connection timed out at > /usr/lib/git-core/git-svn line 2322 > > > cheers, > Chris > - -- > Christopher Samuel - Senior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.unimelb.edu.au/ > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.11 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iEYEARECAAYFAk95AGsACgkQO2KABBYQAh+Z0QCdEcuPvqaLa10RZK02/dUbmWql > vogAnjN7u7UZC3SV8LrjAk06offRGFDf > =tBBX > -----END PGP SIGNATURE----- > _______________________________________________ > torquedev mailing list > torquedev at supercluster.org > http://www.supercluster.org/mailman/listinfo/torquedev > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torquedev/attachments/20120402/8d1cdeb8/attachment.html From samuel at unimelb.edu.au Mon Apr 2 19:49:34 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Tue, 03 Apr 2012 11:49:34 +1000 Subject: [torquedev] Torque SVN server down ? In-Reply-To: References: <4F79006C.7040207@unimelb.edu.au> Message-ID: <4F7A572E.6090303@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 03/04/12 03:24, Ken Nielson wrote: > Just fixed the problem. It was a firewall issue. Thanks! Can confirm it's working from here. All the best, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk96Vy4ACgkQO2KABBYQAh8jGgCfbg/TNXr8OGmafrSWjR3qI31D zpIAmwZvwJA+bjBSFo1STNbmz/ZJbk/j =wf5u -----END PGP SIGNATURE----- From knielson at adaptivecomputing.com Tue Apr 3 15:43:17 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Tue, 3 Apr 2012 15:43:17 -0600 Subject: [torquedev] TORQUE 4.0.1 release candidate available Message-ID: Hi all, There is a TORQUE 4.0.1 release candidate available for download. There are several bug fixes and code refactoring over the 4.0.0 release. Please see the CHANGELOG for a list of bug fixes and enhancements. The code can be downloaded from http://www.adaptivecomputing.com/resources/downloads/torque/snapshots/ torque-4.0.1-snap.201204031514.tar.gz Please download this and let us know what problems you find. Regards Ken Nielson -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torquedev/attachments/20120403/c9beaf68/attachment.html From dbeer at adaptivecomputing.com Tue Apr 3 17:04:41 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Tue, 3 Apr 2012 17:04:41 -0600 Subject: [torquedev] [torqueusers] TORQUE 4.0.1 release candidate available In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF805AC0912@ORSMSX106.amr.corp.intel.com> References: <560DBE57F33C4C4C9FBF11C662951AF805AC08BA@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC08CD@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC08E3@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC0912@ORSMSX106.amr.corp.intel.com> Message-ID: We apologize for this problem - it appears that make distcheck wasn't run before this was sent out. Please check the new snapshot here: http://www.adaptivecomputing.com/resources/downloads/torque/snapshots/torque-4.0.1-snap.201204031702.tar.gz Thanks Steve for diving into this and finding the problem. David On Tue, Apr 3, 2012 at 4:43 PM, DuChene, StevenX A < stevenx.a.duchene at intel.com> wrote: > FYI, The svn 4.0.1 source tree does seem to build into the required rpms > just fine.**** > > ** ** > > I am currently in the process of updating to those newer rpms now on my > 264 nodes.**** > > --**** > > Steven DuChene**** > > ** ** > > *From:* torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] *On Behalf Of *DuChene, StevenX A > *Sent:* Tuesday, April 03, 2012 3:13 PM > > *To:* Torque Users Mailing List; Torque Developers mailing list > *Subject:* Re: [torqueusers] TORQUE 4.0.1 release candidate available**** > > ** ** > > I just pulled the svn tree for 4.0.1 and the file is there too so the tar > ball must be incomplete.**** > > ?**** > > Steven DuChene**** > > ** ** > > *From:* torqueusers-bounces at supercluster.org [ > mailto:torqueusers-bounces at supercluster.org] > *On Behalf Of *DuChene, StevenX A > *Sent:* Tuesday, April 03, 2012 3:07 PM > *To:* Torque Users Mailing List; Torque Developers mailing list > *Subject:* Re: [torqueusers] TORQUE 4.0.1 release candidate available**** > > ** ** > > I just checked my svn source tree for 4.0-fixes and the file is present > there.**** > > --**** > > Steven DuChene**** > > ** ** > > *From:* torqueusers-bounces at supercluster.org [ > mailto:torqueusers-bounces at supercluster.org] > *On Behalf Of *DuChene, StevenX A > *Sent:* Tuesday, April 03, 2012 3:03 PM > *To:* Torque Users Mailing List; Torque Developers mailing list > *Subject:* Re: [torqueusers] TORQUE 4.0.1 release candidate available**** > > ** ** > > Tried to use the packaged spec file in the snapshot to build rpms but it > failed. It seems that a referenced include file is missing or the name is > misspelled.**** > > ** ** > > gcc -DHAVE_CONFIG_H -I. -I../../src/include -I../../src/include > -I../../src/resmom/linux -DPBS_MOM -DDEMUX=\"/usr/sbin/pbs_demux\" > -DRCP_PATH=\"/usr/bin/scp\" -DRCP_ARGS=\"-rpB\" > -DPBS_SERVER_HOME=\"/var/spool/torque\" > -DPBS_ENVIRON=\"/var/spool/torque/pbs_environment\" -O2 -g -pipe -Wall > -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector > --param=ssp-buffer-size=4 -m64 -mtune=generic -c -o job_recov.o `test -f > '../server/job_recov.c' || echo './'`../server/job_recov.c**** > > mom_comm.c:127:42: error: mom_job_func.h: No such file or directory**** > > mom_process_request.c:43:42: error: mom_job_func.h: No such file or > directory**** > > mom_process_request.c: In function 'close_quejob':**** > > mom_process_request.c:540: warning: implicit declaration of function > 'job_purge'**** > > catch_child.c:38:42: error: mom_job_func.h: No such file or directory**** > > mom_comm.c: In function 'im_join_job_as_sister':**** > > mom_comm.c:2427: warning: implicit declaration of function 'job_purge'**** > > make[3]: *** [mom_process_request.o] Error 1**** > > make[3]: *** Waiting for unfinished jobs....**** > > mom_main.c:81:42: error: mom_job_func.h: No such file or directory**** > > catch_child.c: In function 'mom_deljob':**** > > catch_child.c:2008: warning: implicit declaration of function 'job_purge'* > *** > > mom_req_quejob.c:106:42: error: mom_job_func.h: No such file or directory* > *** > > make[3]: *** [catch_child.o] Error 1**** > > mom_req_quejob.c: In function 'req_quejob':**** > > mom_req_quejob.c:353: warning: implicit declaration of function 'job_purge' > **** > > mom_main.c: In function 'rm_request':**** > > mom_main.c:4496: warning: implicit declaration of function 'job_purge'**** > > make[3]: *** [mom_req_quejob.o] Error 1**** > > make[3]: *** [mom_comm.o] Error 1**** > > make[3]: *** [mom_main.o] Error 1**** > > make[3]: Leaving directory > `/root/rpmbuild/BUILD/torque-4.0.1-snap.201204031514/src/resmom'**** > > make[2]: *** [all-recursive] Error 1**** > > make[2]: Leaving directory > `/root/rpmbuild/BUILD/torque-4.0.1-snap.201204031514/src/resmom'**** > > make[1]: *** [all-recursive] Error 1**** > > make[1]: Leaving directory > `/root/rpmbuild/BUILD/torque-4.0.1-snap.201204031514/src'**** > > make: *** [all-recursive] Error 1**** > > error: Bad exit status from /var/tmp/rpm-tmp.9L8CSg (%build)**** > > ** ** > > *From:* torqueusers-bounces at supercluster.org [ > mailto:torqueusers-bounces at supercluster.org] > *On Behalf Of *Ken Nielson > *Sent:* Tuesday, April 03, 2012 2:43 PM > *To:* torqueusers; Torque Developers mailing list > *Subject:* [torqueusers] TORQUE 4.0.1 release candidate available**** > > ** ** > > Hi all, > > There is a TORQUE 4.0.1 release candidate available for download. There > are several bug fixes and code refactoring over the 4.0.0 release. Please > see the CHANGELOG for a list of bug fixes and enhancements. > > The code can be downloaded from > http://www.adaptivecomputing.com/resources/downloads/torque/snapshots/ > torque-4.0.1-snap.201204031514.tar.gz > > Please download this and let us know what problems you find. > > Regards > > Ken Nielson**** > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torquedev/attachments/20120403/8ed9de38/attachment-0001.html From dbeer at adaptivecomputing.com Tue Apr 3 17:19:29 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Tue, 3 Apr 2012 17:19:29 -0600 Subject: [torquedev] [torqueusers] TORQUE 4.0.1 release candidate available In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF805AC094F@ORSMSX106.amr.corp.intel.com> References: <560DBE57F33C4C4C9FBF11C662951AF805AC08BA@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC094F@ORSMSX106.amr.corp.intel.com> Message-ID: Are you certain you used the correct tarball? I am able to build the new one and not the old one. Also: dbeer at napali:/home/dbeer/Downloads/torque-4.0.1# ls src/resmom/| grep mom_job_func.h mom_job_func.h David On Tue, Apr 3, 2012 at 5:14 PM, DuChene, StevenX A < stevenx.a.duchene at intel.com> wrote: > I also download the snapshot file again on a completely different system > and the mom_job_func.h file is still not present.**** > > ** ** > > I tried the same build over on that system from that newly downloaded > snapshot file and if failed in the same spot due to the missing include > file.**** > > --**** > > Steven DuChene**** > > ** ** > > *From:* torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] *On Behalf Of *DuChene, StevenX A > > *Sent:* Tuesday, April 03, 2012 3:03 PM > *To:* Torque Users Mailing List; Torque Developers mailing list > *Subject:* Re: [torqueusers] TORQUE 4.0.1 release candidate available**** > > ** ** > > Tried to use the packaged spec file in the snapshot to build rpms but it > failed. It seems that a referenced include file is missing or the name is > misspelled.**** > > ** ** > > gcc -DHAVE_CONFIG_H -I. -I../../src/include -I../../src/include > -I../../src/resmom/linux -DPBS_MOM -DDEMUX=\"/usr/sbin/pbs_demux\" > -DRCP_PATH=\"/usr/bin/scp\" -DRCP_ARGS=\"-rpB\" > -DPBS_SERVER_HOME=\"/var/spool/torque\" > -DPBS_ENVIRON=\"/var/spool/torque/pbs_environment\" -O2 -g -pipe -Wall > -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector > --param=ssp-buffer-size=4 -m64 -mtune=generic -c -o job_recov.o `test -f > '../server/job_recov.c' || echo './'`../server/job_recov.c**** > > mom_comm.c:127:42: error: mom_job_func.h: No such file or directory**** > > mom_process_request.c:43:42: error: mom_job_func.h: No such file or > directory**** > > mom_process_request.c: In function 'close_quejob':**** > > mom_process_request.c:540: warning: implicit declaration of function > 'job_purge'**** > > catch_child.c:38:42: error: mom_job_func.h: No such file or directory**** > > mom_comm.c: In function 'im_join_job_as_sister':**** > > mom_comm.c:2427: warning: implicit declaration of function 'job_purge'**** > > make[3]: *** [mom_process_request.o] Error 1**** > > make[3]: *** Waiting for unfinished jobs....**** > > mom_main.c:81:42: error: mom_job_func.h: No such file or directory**** > > catch_child.c: In function 'mom_deljob':**** > > catch_child.c:2008: warning: implicit declaration of function 'job_purge'* > *** > > mom_req_quejob.c:106:42: error: mom_job_func.h: No such file or directory* > *** > > make[3]: *** [catch_child.o] Error 1**** > > mom_req_quejob.c: In function 'req_quejob':**** > > mom_req_quejob.c:353: warning: implicit declaration of function 'job_purge' > **** > > mom_main.c: In function 'rm_request':**** > > mom_main.c:4496: warning: implicit declaration of function 'job_purge'**** > > make[3]: *** [mom_req_quejob.o] Error 1**** > > make[3]: *** [mom_comm.o] Error 1**** > > make[3]: *** [mom_main.o] Error 1**** > > make[3]: Leaving directory > `/root/rpmbuild/BUILD/torque-4.0.1-snap.201204031514/src/resmom'**** > > make[2]: *** [all-recursive] Error 1**** > > make[2]: Leaving directory > `/root/rpmbuild/BUILD/torque-4.0.1-snap.201204031514/src/resmom'**** > > make[1]: *** [all-recursive] Error 1**** > > make[1]: Leaving directory > `/root/rpmbuild/BUILD/torque-4.0.1-snap.201204031514/src'**** > > make: *** [all-recursive] Error 1**** > > error: Bad exit status from /var/tmp/rpm-tmp.9L8CSg (%build)**** > > ** ** > > *From:* torqueusers-bounces at supercluster.org [ > mailto:torqueusers-bounces at supercluster.org] > *On Behalf Of *Ken Nielson > *Sent:* Tuesday, April 03, 2012 2:43 PM > *To:* torqueusers; Torque Developers mailing list > *Subject:* [torqueusers] TORQUE 4.0.1 release candidate available**** > > ** ** > > Hi all, > > There is a TORQUE 4.0.1 release candidate available for download. There > are several bug fixes and code refactoring over the 4.0.0 release. Please > see the CHANGELOG for a list of bug fixes and enhancements. > > The code can be downloaded from > http://www.adaptivecomputing.com/resources/downloads/torque/snapshots/ > torque-4.0.1-snap.201204031514.tar.gz > > Please download this and let us know what problems you find. > > Regards > > Ken Nielson**** > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torquedev/attachments/20120403/09348906/attachment.html From dbeer at adaptivecomputing.com Wed Apr 4 08:59:42 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Wed, 4 Apr 2012 08:59:42 -0600 Subject: [torquedev] [torqueusers] TORQUE 4.0 and hwloc In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF805AC09DA@ORSMSX106.amr.corp.intel.com> References: <560DBE57F33C4C4C9FBF11C662951AF805ABC89E@ORSMSX106.amr.corp.intel.com> <4F66C6BE.3010605@vpac.org> <560DBE57F33C4C4C9FBF11C662951AF805ABD031@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC09DA@ORSMSX106.amr.corp.intel.com> Message-ID: Steven, I was supposed to add that note and I forgot - my mistake and thanks for catching it. I have now added: *** For admins that use cpusets in any form *** hwloc version 1.1 or greater is now required for building TORQUE with cpusets, as pbs_mom now uses the hwloc API to create the cpusets instead of creating them manually. to README.building_40. As far as checking for the existence of the library, this does happen at configure time once the configure script determines that the user is going to be using cpusets in any way, which a few different configure options can trigger. David On Tue, Apr 3, 2012 at 8:15 PM, DuChene, StevenX A < stevenx.a.duchene at intel.com> wrote: > I installed hwloc-1.4.1 and hwloc-devel-1.4.1 rpms on the server where I > am building torque-4.X and in looking through the output from the configure > script during the build I do not see anywhere that the existence of any > hwloc stuff is checked. In fact in grepping through the output from the > whole torque rpm build process I do not see ANY mention of hwloc at all.** > ** > > ** ** > > I see compile time flags of HWLOC_CFLAGS and HWLOC_LIBS mentioned in the > ?help output from configure but according to the description text this is > just supposed to over-ride the pkg-config results however I do not see any > evidence that the pkg-config system is being quizzed at all for the > existence of hwloc on the build server.**** > > ** ** > > Is there some step I am missing?**** > > ** ** > > I thought someone mentioned that there would be better documentation of > the hwloc business in the torque-4.0.1 release?**** > > ** ** > > If so where is it?**** > > --**** > > Steven DuChene**** > > ** ** > > *From:* torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] *On Behalf Of *David Beer > *Sent:* Monday, March 19, 2012 8:54 AM > *To:* Torque Users Mailing List > *Subject:* Re: [torqueusers] TORQUE 4.0 Officially Announced**** > > ** ** > > Steve,**** > > ** ** > > Hwloc is now required for running cpusets in TORQUE, and it helps out a > lot both in immediate use and in groundwork for future features.**** > > ** ** > > Immediately hwloc gives you a better cpuset because it gives you the next > core instead of the next indexed core. For example: many eight core systems > have processors 0, 2, 4, and 6 next to each other and processors 1, 3, 5, > and 7 next to each other. If you're running a pre-4.0 TORQUE, and you have > two jobs on the node, each with 4 cores, job 1 will have 0-3 and job 2 will > have 4-7. In TORQUE 4.0, job 1 will have 0, 2, 4, and 6, and job 2 will > have 1, 3, 5, and 7. This should help speed up processing times for jobs > (NOTE: only if you have this kind of system and a comparable job layout, > I'm not promising a general speed-up to everyone using cpusets). This > should also allow us to properly handle hyperthreading for anyone that has > it turned on and wishes to use it.**** > > ** ** > > The last immediate feature is if you have SMT (simultaneous > multi-threading) hardware. The mom config variable $use_smt was added. By > default, the use of SMT is enabled, but you can tell your pbs_mom to ignore > them (not place them in the cpuset) using by adding**** > > ** ** > > $use_smt false**** > > ** ** > > to your mom config file**** > > ** ** > > For the future, the hwloc threads make it really easy for us to handle > hardware specific requests. One of the coming features for TORQUE is to > allow requests roughly similar to:**** > > ** ** > > socket=2:numa=2 --with-hyperthreads**** > > ** ** > > which would say to spread the job over 2 sockets, and across the 2 numa > nodes on each socket. This is a feature we plan to add to improve support > for Magny-Cours and Opteron type processors that have multiple sockets and > or multiple numa nodes on the processor chip. Using hwloc makes it so we > don't have to parse system files and map the indices to the sockets and/or > numa nodes ourselves, we can simply use easy hwloc functions > like hwloc_get_next_obj_inside_cpuset_by_type() that allow you to just move > on to the next physical core or virtual core, or skip to the next socket or > numa node as the case may be.**** > > ** ** > > David**** > > On Mon, Mar 19, 2012 at 8:47 AM, DuChene, StevenX A < > stevenx.a.duchene at intel.com> wrote:**** > > Also a better (more complete) explanation of what features are enabled > when hwloc is used would be helpful as well. > > BTW, I built torque on my server without hwloc installed and then > installed the resulting mom packages on my nodes. The mom daemons in that > case did seem to start up just fine. > -- > Steven DuChene**** > > > -----Original Message----- > From: torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] On Behalf Of Craig West > Sent: Sunday, March 18, 2012 10:40 PM > To: Torque Users mailing list; Torque Developers mailing list**** > > Subject: Re: [torqueusers] TORQUE 4.0 Officially Announced > > > Hi Steven, > > I have just begun testing Torque 4.0, as hwloc has been a long awaited > feature for me. > > > It is unclear from this announcement text where hwloc has to be > installed. > > Is it just on the server or on the nodes only? > > It needs to be available on the BUILD server and the nodes. I tried to > run pbs_mom on a node without the hwloc I had installed and it failed. > > Note: I am running hwloc 1.4 from a directory in /usr/local > This was not automatically found by the TORQUE configure script, but you > can specify the location using HWLOC_CFLAGS & HWLOC_LIBS. > It embeds the locations that you specify in the pbs_mom (and other > files) but it seems you can set the LD_LIBRARY_PATH variable if it is > not in the same location on the BUILD server as the compute nodes. > For simplicity installing them in the same location makes sense. > > > More documentation about this would be greatly appreciated. > > I agree, clearer and more detailed documentation would be useful. > > Cheers, > Craig. > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers**** > > > > **** > > ** ** > > -- **** > > David Beer | Software Engineer**** > > Adaptive Computing**** > > ** ** > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torquedev/attachments/20120404/5179d804/attachment-0001.html From knielson at adaptivecomputing.com Thu Apr 5 14:50:09 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Thu, 5 Apr 2012 14:50:09 -0600 Subject: [torquedev] New 3.0.5 release candidate available Message-ID: Hi all, There is a new TORQUE 3.0.5 release candidate available. This candidate has a fix for yet another case where procct does not get deleted before getting passed to the scheduler making it so a job cannot run. See the CHANGELOG for all updates. The code can be downloaded at http://www.adaptivecomputing.com/resources/downloads/torque/snapshots/torque-3.0.5-snap.201204051313.tar.gz Please download this if you are using 3.0.x and let us know if you find any problems. Regards Ken Nielson Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torquedev/attachments/20120405/c1c2641e/attachment.html From alan at madllama.net Sun Apr 15 19:51:10 2012 From: alan at madllama.net (Alan Wild) Date: Sun, 15 Apr 2012 20:51:10 -0500 Subject: [torquedev] "Fixing" qsig -s USR1 and kill_delay on torque 2.5.x In-Reply-To: References: Message-ID: I got some more time this evening to comapre my patch against the latest 3.0.5 snapshot. It turns out that the 2.x patch applies cleanly against the 3.x tree. I think the last version of the 2.x patch was eaten by the mail server during the outtage so here it is again. -%<---%<--CUT HERE---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<-- diff -rN -U4 torque-3.0.5-snap.201204051313-old/src/resmom/start_exec.c torque-3.0.5-snap.201204051313/src/resmom/start_exec.c --- torque-3.0.5-snap.201204051313-old/src/resmom/start_exec.c 2012-04-05 14:13:16.000000000 -0500 +++ torque-3.0.5-snap.201204051313/src/resmom/start_exec.c 2012-04-15 20:42:00.000000000 -0500 @@ -2048,8 +2048,13 @@ if (TJE->is_interactive == FALSE) { int k; + if (strlen(buf)+5 <= MAXPATHLEN) { + memmove(buf+5,buf,strlen(buf)+1); + strncpy(buf, "exec ", 5); + } + /* pass name of shell script on pipe */ /* will be stdin of shell */ close(TJE->pipe_script[0]); @@ -3732,9 +3737,9 @@ { arg[aindex] = malloc( strlen(path_jobs) + strlen(pjob->ji_qs.ji_fileprefix) + - strlen(JOB_SCRIPT_SUFFIX) + 1); + strlen(JOB_SCRIPT_SUFFIX) + 6); if (arg[aindex] == NULL) { log_err(errno,id,"cannot alloc env"); @@ -3745,9 +3750,10 @@ return(-1); } - strcpy(arg[aindex], path_jobs); + strcpy(arg[aindex], "exec "); + strcat(arg[aindex], path_jobs); strcat(arg[aindex], pjob->ji_qs.ji_fileprefix); strcat(arg[aindex], JOB_SCRIPT_SUFFIX); arg[aindex + 1] = NULL; -%<---%<--CUT HERE---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<-- -Alan On Wed, Mar 28, 2012 at 2:47 PM, Alan Wild wrote: > We still don't have permission to install torque-4.0.0, even on our test > systems. However, I thought I would take a look at the source for pbs_mom > to see how it works. It appears, overall, very similiar to 2.5.11. So I > attempted to port my patch to 4.0.0 > > This does compile, but I can't comment on whether or not it will run. :) > > > -%<---%<--CUT > HERE---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<-- > diff -rN -U4 torque-4.0.0/src/resmom/start_exec.c > torque-4.0.0-new/src/resmom/start_exec.c > --- torque-4.0.0/src/resmom/start_exec.c 2012-02-21 > 17:43:51.000000000 -0600 > +++ torque-4.0.0-new/src/resmom/start_exec.c 2012-03-28 > 14:34:37.000000000 -0500 > @@ -2213,8 +2213,13 @@ > if (TJE->is_interactive == FALSE) > { > int k; > > + if (strlen(buf)+5 <= MAXPATHLEN) { > + memmove(buf+5,buf,strlen(buf)+1); > + strncpy(buf, "exec ", 5); > + } > + > /* pass name of shell script on pipe */ > /* will be stdin of shell */ > > close(TJE->pipe_script[0]); > @@ -3881,9 +3886,9 @@ > { > arg[aindex] = calloc(1, > strlen(path_jobs) + > strlen(pjob->ji_qs.ji_fileprefix) + > - strlen(JOB_SCRIPT_SUFFIX) + 1); > + strlen(JOB_SCRIPT_SUFFIX) + 6); > > if (arg[aindex] == NULL) > { > log_err(errno,id,"cannot alloc env"); > @@ -3894,9 +3899,10 @@ > > return(-1); > } > > - strcpy(arg[aindex], path_jobs); > + strcpy(arg[aindex], "exec "); > + strcat(arg[aindex], path_jobs); > strcat(arg[aindex], pjob->ji_qs.ji_fileprefix); > strcat(arg[aindex], JOB_SCRIPT_SUFFIX); > > arg[aindex + 1] = NULL; > -%<---%<--CUT > HERE---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<-- > > > -Alan > > > On Mon, Mar 26, 2012 at 11:15 PM, Alan Wild wrote: > > >> Sorry, real world intruded the last couple of weeks and I haven't had a >> chance to dive back into this. Yes, for users building Torque with >> SHELL_USE_ARGV == 1, you would need to modify TMomFinalizeChild(). >> However, we don't build Torque this way, so I haven't had a chance to >> really test this. Regardless, I took a stab at making the patch more >> complete. >> >> This is against the released 2.5.11: >> >> >> diff -rN -U2 torque-2.5.11/src/resmom/start_exec.c >> torque-2.5.11-new/src/resmom/start_exec.c >> --- torque-2.5.11/src/resmom/start_exec.c 2012-03-08 >> 15:34:57.000000000 -0600 >> +++ torque-2.5.11-new/src/resmom/start_exec.c 2012-03-26 >> 23:03:56.000000000 -0500 >> @@ -1997,4 +1997,9 @@ >> int k; >> >> + if (strlen(buf)+5 <= MAXPATHLEN) { >> + memmove(buf+5,buf,strlen(buf)+1); >> + strncpy(buf, "exec ", 5); >> + } >> + >> /* pass name of shell script on pipe */ >> /* will be stdin of shell */ >> @@ -3641,5 +3646,5 @@ >> strlen(path_jobs) + >> strlen(pjob->ji_qs.ji_fileprefix) + >> - strlen(JOB_SCRIPT_SUFFIX) + 1); >> + strlen(JOB_SCRIPT_SUFFIX) + 6); >> >> if (arg[aindex] == NULL) >> @@ -3654,5 +3659,6 @@ >> } >> >> - strcpy(arg[aindex], path_jobs); >> + strcpy(arg[aindex], "exec "); >> + strcat(arg[aindex], path_jobs); >> strcat(arg[aindex], pjob->ji_qs.ji_fileprefix); >> strcat(arg[aindex], JOB_SCRIPT_SUFFIX); >> >> I would love to know if anyone other than me has played with this patch >> and whether or not it's looking viable. >> >> -Alan >> >> >> On Mon, Mar 19, 2012 at 4:52 AM, >> wrote: >> > >> > I definitely agree that exec'ing the script is the correct way to >> > spawn it. I think the patch is reasonable. >> > >> > Would "exec " also need to be added to the shell command line in >> > TMomFinalizeChild() in the SHELL_USE_ARGV == 1 case? >> > >> > Michael >> >> -- >> alan at madllama.net http://humbleville.blogspot.com >> > > > -- > alan at madllama.net http://humbleville.blogspot.com > > -- alan at madllama.net http://humbleville.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torquedev/attachments/20120415/77e9d272/attachment.html From bugzilla-daemon at supercluster.org Thu Apr 19 13:36:21 2012 From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org) Date: Thu, 19 Apr 2012 13:36:21 -0600 (MDT) Subject: [torquedev] [Bug 175] New: buffer overruns in cpuset.c Message-ID: http://www.clusterresources.com/bugzilla/show_bug.cgi?id=175 Summary: buffer overruns in cpuset.c Product: TORQUE Version: 3.0.x Platform: PC OS/Version: Linux Status: NEW Severity: blocker Priority: P5 Component: pbs_mom AssignedTo: knielson at adaptivecomputing.com ReportedBy: siegert at sfu.ca CC: torquedev at supercluster.org Estimated Hours: 0.0 I am trying to get torque 3.0.4 working in a UV1000 system with 2048 cores. There are buffer overruns in cpuset.c, add_cpus_to_jobset: Both cpusbuf and memsbuf are allocated with a fixed length of MAXPATHLEN+1, however, cpusbuf holds the list (comma separated) of cpus (cores), i.e., its length must be (# of digits in Ncores + 1) * Ncores, where Ncores is the number of cores in the system, e.g., for Ncores = 2048 the required length is 10240; MAXPATHLEN is 1024. The solution is something like: char *cpusbuf; int len_Ncores = 2, cnt; for (cnt = Ncores; cnt > 9; cnt /= 10) len_Ncores++; cpusbuf = (char *)malloc(len_Ncores*Ncores*sizeof(char)); Similarly for memsbuf. My problem right now is how do I get Ncores (and similarly Nmems) in add_cpus_to_jobset? For now I can hardcode the length of 10240, but that is going to break as soon as somebody tries to run on a machine with more than 2048 cores. - Martin -- Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From bugzilla-daemon at supercluster.org Thu Apr 19 16:19:47 2012 From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org) Date: Thu, 19 Apr 2012 16:19:47 -0600 (MDT) Subject: [torquedev] [Bug 175] buffer overruns in cpuset.c In-Reply-To: References: Message-ID: <20120419221947.A1D3A3EA81B0@http.supercluster.org> http://www.clusterresources.com/bugzilla/show_bug.cgi?id=175 dbeer at adaptivecomputing.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |dbeer at adaptivecomputing.com --- Comment #1 from dbeer at adaptivecomputing.com 2012-04-19 16:19:47 MDT --- I would use the dynamic_string struct instead of trying to calculate these numbers from beforehand. It is fairly easy to use, just check out dynamic_string.h. -- Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From bugzilla-daemon at supercluster.org Mon Apr 23 15:52:06 2012 From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org) Date: Mon, 23 Apr 2012 15:52:06 -0600 (MDT) Subject: [torquedev] [Bug 176] New: torque-4.0.0: HWLOC_CFLAGS and HWLOC_LIBS ignored Message-ID: http://www.clusterresources.com/bugzilla/show_bug.cgi?id=176 Summary: torque-4.0.0: HWLOC_CFLAGS and HWLOC_LIBS ignored Product: TORQUE Version: 3.0.x Platform: PC OS/Version: Linux Status: NEW Severity: major Priority: P5 Component: libtorque AssignedTo: dbeer at adaptivecomputing.com ReportedBy: siegert at sfu.ca CC: torquedev at supercluster.org Estimated Hours: 0.0 Created an attachment (id=103) --> (http://www.clusterresources.com/bugzilla/attachment.cgi?id=103) patch to include HWLOC env. variables in Libifl Makefile When configuring torque-4.0.0 with HWLOC_CFLAGS and HWLOC_LIBS set the subsequent make fails in Libifl because those environment variables are not used in the Makefile: gcc -DHAVE_CONFIG_H -I. -I../../../src/include -I../../../src/include -DPBS_DEFAULT_FILE=\"/var/spool/torque/server_name\" -DPBS_SERVER_HOME=\"/var/spool/torque\" -g -DGEOMETRY_REQUESTS -DALWAYS_USE_CPUSETS -DNUMA_SUPPORT -MT tcp_dis.o -MD -MP -MF .deps/tcp_dis.Tpo -c -o tcp_dis.o tcp_dis.c In file included from ../Libutils/u_lock_ctl.h:5, from tcp_dis.c:101: ../../../src/include/pbs_nodes.h:98:21: error: hwloc.h: No such file or directory In file included from ../Libutils/u_lock_ctl.h:5, from tcp_dis.c:101: ../../../src/include/pbs_nodes.h:225: error: expected specifier-qualifier-list before ?hwloc_bitmap_t? make[4]: *** [tcp_dis.o] Error 1 make[4]: Leaving directory `/usr/local/src/torque-4.0.0/src/lib/Libifl' The attached patch works for me. - Martin -- Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From bugzilla-daemon at supercluster.org Mon Apr 23 18:49:22 2012 From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org) Date: Mon, 23 Apr 2012 18:49:22 -0600 (MDT) Subject: [torquedev] [Bug 177] New: Init script for pbs_server can't stop the daemon Message-ID: http://www.clusterresources.com/bugzilla/show_bug.cgi?id=177 Summary: Init script for pbs_server can't stop the daemon Product: TORQUE Version: 3.0.x Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P5 Component: pbs_server AssignedTo: dbeer at adaptivecomputing.com ReportedBy: rhys.hill at adelaide.edu.au CC: torquedev at supercluster.org Estimated Hours: 0.0 The init script for pbs_server in torque 4.x has the following to stop the daemon: killproc pbs_server -TERM which doesn't seem to work. This does: killall -QUIT pbs_server This has the effect that issuing the following command: sudo /etc/init.d/pbs_server restart gives this output: Shutting down TORQUE Server: [ OK ] pbs_server is already running. which is quite confusing. -- Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From bugzilla-daemon at supercluster.org Mon Apr 23 18:52:58 2012 From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org) Date: Mon, 23 Apr 2012 18:52:58 -0600 (MDT) Subject: [torquedev] [Bug 178] New: Array jobs should be allowed to have single job dependencies. Message-ID: http://www.clusterresources.com/bugzilla/show_bug.cgi?id=178 Summary: Array jobs should be allowed to have single job dependencies. Product: TORQUE Version: 3.0.x Platform: PC OS/Version: Linux Status: NEW Severity: enhancement Priority: P5 Component: pbs_server AssignedTo: dbeer at adaptivecomputing.com ReportedBy: rhys.hill at adelaide.edu.au CC: torquedev at supercluster.org Estimated Hours: 0.0 At present, torque 4.x (at least) requires that array jobs have array dependencies, which seems an artificial restriction. We commonly use torque for map/reduce type operations, where you may have a large array of jobs to preprocess some data, then a single followup job which gathers the data together and applies some algorithm to it. -- Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From bugzilla-daemon at supercluster.org Mon Apr 23 19:22:52 2012 From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org) Date: Mon, 23 Apr 2012 19:22:52 -0600 (MDT) Subject: [torquedev] [Bug 179] New: autogen.sh tries to use incorrect m4 directory Message-ID: http://www.clusterresources.com/bugzilla/show_bug.cgi?id=179 Summary: autogen.sh tries to use incorrect m4 directory Product: TORQUE Version: 3.0.x Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P5 Component: pbs_server AssignedTo: dbeer at adaptivecomputing.com ReportedBy: rhys.hill at adelaide.edu.au CC: torquedev at supercluster.org Estimated Hours: 0.0 On RHEL 5, autogen.sh from the 4.0.1 branch with revision r6023, creates a configure script with the wrong m4 include directory: [rhys at moby torque-4.0.1]$ ./autogen.sh Putting files in AC_CONFIG_AUX_DIR, `buildutils'. aclocal:configure.ac:116: warning: macro `AM_SILENT_RULES' not found in library aclocal:configure.ac:451: warning: macro `AM_SILENT_RULES' not found in library [rhys at moby torque-4.0.1]$ ./configure checking build system type... x86_64-unknown-linux-gnu checking host system type... x86_64-unknown-linux-gnu configure: error: cannot find macro directory `m4' Changing these two lines in the configure script: case m4 in [\\/]* | ?:[\\/]* ) ac_macro_dir=m4 ;; *) ac_macro_dir=$srcdir/m4 ;; esac to case m4 in [\\/]* | ?:[\\/]* ) ac_macro_dir=buildutils ;; *) ac_macro_dir=$srcdir/buildutils ;; esac fixes the problem. I have automake v1.9.6 and autoconf v2.59 -- Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From bugzilla-daemon at supercluster.org Mon Apr 23 19:40:59 2012 From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org) Date: Mon, 23 Apr 2012 19:40:59 -0600 (MDT) Subject: [torquedev] [Bug 180] New: configure script has hwloc->pkg-config->check dependency Message-ID: http://www.clusterresources.com/bugzilla/show_bug.cgi?id=180 Summary: configure script has hwloc->pkg-config->check dependency Product: TORQUE Version: 3.0.x Platform: PC OS/Version: Linux Status: NEW Severity: enhancement Priority: P5 Component: pbs_server AssignedTo: dbeer at adaptivecomputing.com ReportedBy: rhys.hill at adelaide.edu.au CC: torquedev at supercluster.org Estimated Hours: 0.0 The configure script in 4.0.1 r6023 seems to have an ordering dependency such that hwloc requires pkg-config to be present, which is fine. But pkg-config requires a program called check to be present, which seems to only be necessary for unit testing? If check is not present, configure fails if you enable cpusets, saying that hwloc is missing. The following work around enables configure to finish: PKG_CONFIG=pkg-config ./configure --enable-cpuset --with-pam=yes --with-default-server=moby.cs.adelaide.edu.au -- Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From bugzilla-daemon at supercluster.org Mon Apr 23 22:22:53 2012 From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org) Date: Mon, 23 Apr 2012 22:22:53 -0600 (MDT) Subject: [torquedev] [Bug 181] New: Deadlock in pbsd_init_reque Message-ID: http://www.clusterresources.com/bugzilla/show_bug.cgi?id=181 Summary: Deadlock in pbsd_init_reque Product: TORQUE Version: 3.0.x Platform: PC OS/Version: Linux Status: NEW Severity: major Priority: P5 Component: pbs_server AssignedTo: dbeer at adaptivecomputing.com ReportedBy: rhys.hill at adelaide.edu.au CC: torquedev at supercluster.org Estimated Hours: 0.0 pbsd_init_reque currently causes a deadlock on error in torque 4.0.1 r6023. The code looks like this: ---------------------------------------- pthread_mutex_lock(server.sv_qs_mutex); if (svr_enquejob(pjob, TRUE, -1) == PBSE_NONE) { ... Went OK } else { ... Had an error job_abt(&pjob, logbuf); /* NOTE: pjob freed but dangling pointer remains */ } pthread_mutex_unlock(server.sv_qs_mutex); ---------------------------------------- However, the calls within job_abt eventually try to lock sv_qs_mutex, which obviously fails. This version is OK: ---------------------------------------- pthread_mutex_lock(server.sv_qs_mutex); if (svr_enquejob(pjob, TRUE, -1) == PBSE_NONE) { strcat(logbuf, msg_init_queued); strcat(logbuf, pjob->ji_qs.ji_queue); log_event( PBSEVENT_SYSTEM | PBSEVENT_ADMIN | PBSEVENT_DEBUG, PBS_EVENTCLASS_JOB, pjob->ji_qs.ji_jobid, logbuf); pthread_mutex_unlock(server.sv_qs_mutex); } else { /* Oops, this should never happen */ sprintf(logbuf, "%s; job %s queue %s", msg_err_noqueue, pjob->ji_qs.ji_jobid, pjob->ji_qs.ji_queue); log_err(-1, "pbsd_init_reque", logbuf); pthread_mutex_unlock(server.sv_qs_mutex); job_abt(&pjob, logbuf); /* NOTE: pjob freed but dangling pointer remains */ } return; } /* END pbsd_init_reque() */ ---------------------------------------- ie. The unlock call is moved before into both branches of the if statement. -- Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From bugzilla-daemon at supercluster.org Mon Apr 23 23:40:23 2012 From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org) Date: Mon, 23 Apr 2012 23:40:23 -0600 (MDT) Subject: [torquedev] [Bug 182] New: serverdb is saved out with missing data Message-ID: http://www.clusterresources.com/bugzilla/show_bug.cgi?id=182 Summary: serverdb is saved out with missing data Product: TORQUE Version: 3.0.x Platform: PC OS/Version: Linux Status: NEW Severity: enhancement Priority: P5 Component: pbs_server AssignedTo: dbeer at adaptivecomputing.com ReportedBy: rhys.hill at adelaide.edu.au CC: torquedev at supercluster.org Estimated Hours: 0.0 The server DB is corrupted on save in 4.0.1 r6023, but the following patch fixes it. The bug is that attr_to_str calls size_to_dynamic_string assuming it will append to the string, when it currently replaces it. It looked to me like it was supposed to append, so the following shortens and corrects size_to_dynamic_string I'm not 100% convinced this is doing the right thing, my server DB currently says that we've got 67645734912kb of memory allocated, with is ~63TB. If that were bytes, it'd be about right. I wonder if the code also needs a case for szv.atsv_shift=0 or similar? Index: src/lib/Libutils/u_dynamic_string.c =================================================================== --- src/lib/Libutils/u_dynamic_string.c (revision 6023) +++ src/lib/Libutils/u_dynamic_string.c (working copy) @@ -155,55 +155,36 @@ { char buffer[MAXLINE]; - int add_one = FALSE; - - if (ds->used == 0) - add_one = TRUE; - + size_t str_size = 0; + + // We insert any old suffix to start with and fix it below: sprintf(buffer, "%lukb", szv.atsv_num); - resize_if_needed(ds, buffer); - - sprintf(buffer, "%lu", szv.atsv_num); - strcat(ds->str, buffer); + str_size = strlen( buffer ); switch (szv.atsv_shift) { case 10: - - strcat(ds->str, "kb"); - + buffer[str_size - 2] = 'k'; break; case 20: - - strcat(ds->str, "mb"); - + buffer[str_size - 2] = 'm'; break; case 30: - - strcat(ds->str, "gb"); - + buffer[str_size - 2] = 'g'; break; case 40: - - strcat(ds->str, "tb"); - + buffer[str_size - 2] = 't'; break; case 50: - - strcat(ds->str, "pb"); - + buffer[str_size - 2] = 'p'; break; } - - ds->used += strlen(buffer) + 2; - if (add_one == TRUE) - ds->used += 1; - - return(PBSE_NONE); + + return append_dynamic_string( ds, buffer ); } /* END size_to_dynamic_string() */ -- Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From bugzilla-daemon at supercluster.org Mon Apr 23 23:43:44 2012 From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org) Date: Mon, 23 Apr 2012 23:43:44 -0600 (MDT) Subject: [torquedev] [Bug 183] New: Minor bug in attr_to_str Message-ID: http://www.clusterresources.com/bugzilla/show_bug.cgi?id=183 Summary: Minor bug in attr_to_str Product: TORQUE Version: 3.0.x Platform: PC OS/Version: Linux Status: NEW Severity: enhancement Priority: P5 Component: pbs_server AssignedTo: dbeer at adaptivecomputing.com ReportedBy: rhys.hill at adelaide.edu.au CC: torquedev at supercluster.org Estimated Hours: 0.0 rc isn't always updated properly, this seems necessary: Index: src/server/svr_recov.c =================================================================== --- src/server/svr_recov.c (revision 6023) +++ src/server/svr_recov.c (working copy) @@ -465,7 +465,7 @@ current = (resource *)GET_NEXT(current->rs_link); } - append_dynamic_string(ds, "\n"); + rc = append_dynamic_string(ds, "\n"); } not a big deal, since in practice rc is always PBSE_NONE -- Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From bugzilla-daemon at supercluster.org Mon Apr 23 23:44:33 2012 From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org) Date: Mon, 23 Apr 2012 23:44:33 -0600 (MDT) Subject: [torquedev] [Bug 183] Minor bug in attr_to_str In-Reply-To: References: Message-ID: <20120424054433.DD046412235B@http.supercluster.org> http://www.clusterresources.com/bugzilla/show_bug.cgi?id=183 rhys.hill at adelaide.edu.au changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|enhancement |normal -- Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From bugzilla-daemon at supercluster.org Mon Apr 23 23:44:53 2012 From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org) Date: Mon, 23 Apr 2012 23:44:53 -0600 (MDT) Subject: [torquedev] [Bug 182] serverdb is saved out with missing data In-Reply-To: References: Message-ID: <20120424054453.E0A5A412236A@http.supercluster.org> http://www.clusterresources.com/bugzilla/show_bug.cgi?id=182 rhys.hill at adelaide.edu.au changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|enhancement |major -- Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From bugzilla-daemon at supercluster.org Mon Apr 23 23:45:13 2012 From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org) Date: Mon, 23 Apr 2012 23:45:13 -0600 (MDT) Subject: [torquedev] [Bug 180] configure script has hwloc->pkg-config->check dependency In-Reply-To: References: Message-ID: <20120424054513.802604122393@http.supercluster.org> http://www.clusterresources.com/bugzilla/show_bug.cgi?id=180 rhys.hill at adelaide.edu.au changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|enhancement |normal -- Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From bugzilla-daemon at supercluster.org Tue Apr 24 00:21:28 2012 From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org) Date: Tue, 24 Apr 2012 00:21:28 -0600 (MDT) Subject: [torquedev] [Bug 184] New: pbs_server fails after you've added 12 job arrays Message-ID: http://www.clusterresources.com/bugzilla/show_bug.cgi?id=184 Summary: pbs_server fails after you've added 12 job arrays Product: TORQUE Version: 3.0.x Platform: PC OS/Version: Linux Status: NEW Severity: critical Priority: P5 Component: pbs_server AssignedTo: dbeer at adaptivecomputing.com ReportedBy: rhys.hill at adelaide.edu.au CC: torquedev at supercluster.org Estimated Hours: 0.0 There's a documentation and then code bug which means that if you add 12 job arrays to pbs_server, it prints out this message: 04/24/2012 15:41:23;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::Cannot allocate memory (12) in insert_array, No memory to resize the array...SYSTEM FAILURE After this, job arrays are broken until pbs_server is restarted, then you get phantom job arrays left in the queue. This fixes it (Note that ENOMEM==12): Index: src/lib/Libutils/u_resizable_array.c =================================================================== --- src/lib/Libutils/u_resizable_array.c (revision 6023) +++ src/lib/Libutils/u_resizable_array.c (working copy) @@ -186,7 +186,8 @@ /* * inserts an item, resizing the array if necessary * - * @return the index in the array or ENOMEM + * @return the index in the array or -1 on failure, + * which indicates ENOMEM. */ int insert_thing( Index: src/server/array_func.c =================================================================== --- src/server/array_func.c (revision 6023) +++ src/server/array_func.c (working copy) @@ -1879,7 +1879,7 @@ pthread_mutex_lock(allarrays.allarrays_mutex); - if ((rc = insert_thing(allarrays.ra,pa)) == ENOMEM) + if ((rc = insert_thing(allarrays.ra,pa)) == -1) { log_err(rc,id,"No memory to resize the array...SYSTEM FAILURE\n"); } -- Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From bugzilla-daemon at supercluster.org Tue Apr 24 18:10:09 2012 From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org) Date: Tue, 24 Apr 2012 18:10:09 -0600 (MDT) Subject: [torquedev] [Bug 175] buffer overruns in cpuset.c In-Reply-To: References: Message-ID: <20120425001009.3338A4122CE7@http.supercluster.org> http://www.clusterresources.com/bugzilla/show_bug.cgi?id=175 --- Comment #2 from Martin Siegert 2012-04-24 18:10:08 MDT --- Created an attachment (id=104) --> (http://www.clusterresources.com/bugzilla/attachment.cgi?id=104) fix buffer overruns in cpuset.c -- Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From bugzilla-daemon at supercluster.org Tue Apr 24 18:11:05 2012 From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org) Date: Tue, 24 Apr 2012 18:11:05 -0600 (MDT) Subject: [torquedev] [Bug 175] buffer overruns in cpuset.c In-Reply-To: References: Message-ID: <20120425001105.D6AA84122CE7@http.supercluster.org> http://www.clusterresources.com/bugzilla/show_bug.cgi?id=175 --- Comment #3 from Martin Siegert 2012-04-24 18:11:05 MDT --- The problem is more complicated: apparently there exists a 4095 byte limit for the kernel VFS. uv1000:/dev/cpuset/torque/martin # seq -s, 1 1041 | wc 1 1 4098 uv1000:/dev/cpuset/torque/martin # seq -s, 1 1040 > cpus ; cat cpus 1-1040 uv1000:/dev/cpuset/torque/martin # seq -s, 1 1041 > cpus ; cat cpus 1 uv1000:/dev/cpuset/torque/martin # echo "1-1041" > cpus ; cat cpus 1-1041 Thus, it is not possible to simply increase the string size of cpusbuf, at least not beyond 4095 which is good for a maximum of only 1040 cpus. The solution is indicated above: it is possible to write ranges to the cpus file, instead of writing a comma separated list of cpus. The attached patch implements such a solution: first sort the list of cpus, then construct a cpusbuf string that collapse the list of cpus into ranges as much as possible. -- Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From bugzilla-daemon at supercluster.org Wed Apr 25 07:42:54 2012 From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org) Date: Wed, 25 Apr 2012 07:42:54 -0600 (MDT) Subject: [torquedev] [Bug 185] New: Can't delete job arrays with dead jobs, such arrays should never be loaded Message-ID: http://www.clusterresources.com/bugzilla/show_bug.cgi?id=185 Summary: Can't delete job arrays with dead jobs, such arrays should never be loaded Product: TORQUE Version: 3.0.x Platform: PC OS/Version: Linux Status: NEW Severity: major Priority: P5 Component: pbs_server AssignedTo: dbeer at adaptivecomputing.com ReportedBy: rhys.hill at adelaide.edu.au CC: torquedev at supercluster.org Estimated Hours: 0.0 Created an attachment (id=105) --> (http://www.clusterresources.com/bugzilla/attachment.cgi?id=105) An example of a job array which cannot be deleted, due to a corrupt/phantom job. It's possible to have corrupt job arrays where they contain jobs that no longer exist. The effect of this is that the array becomes impossible to delete, because currently job arrays are deleted via reference counted garbage collection during purging of the jobs in the array. An attempt to delete such an array currently results in no errors, but the array is not removed. This patch: Index: src/server/array_func.c =================================================================== --- src/server/array_func.c (revision 6023) +++ src/server/array_func.c (working copy) @@ -1272,6 +1272,7 @@ { int i; int num_skipped = 0; + int num_jobs = 0; job *pjob; @@ -1287,6 +1288,7 @@ } else { + num_jobs++; if (pjob->ji_qs.ji_state >= JOB_STATE_EXITING) { /* invalid state for request, skip */ @@ -1303,7 +1305,12 @@ } } } - + + /* If there were no valid jobs, return -1. */ + if(num_jobs==0){ + return -1; + } + return(num_skipped); } /* END delete_whole_array() */ Index: src/server/req_deletearray.c =================================================================== --- src/server/req_deletearray.c (revision 6023) +++ src/server/req_deletearray.c (working copy) @@ -305,10 +305,14 @@ log_event(PBSEVENT_JOB, PBS_EVENTCLASS_JOB, __func__, log_buf); } + if (num_skipped == -1) { + /* the array had no jobs within it, delete it. */ + log_event(PBSEVENT_JOB, PBS_EVENTCLASS_JOB, __func__, "Found array with no jobs - deleting structures."); + array_delete(pa); + } else if (num_skipped != 0) + { /* some jobs were not deleted. They must have been running or had JOB_SUBSTATE_TRANSIT */ - if (num_skipped != 0) - { ptask = set_task(WORK_Timed, time_now + 10, array_delete_wt, preq, FALSE); if (ptask) explicitly deletes the array object if it is found to contain no jobs or no valid jobs. The code which resurrects arrays from disk should also be improved to delete such jobs rather than requeing them, but I'll leave that to the experts! :) With this patch, the arrays can at least be removed. I've also attached an example bad job. It appears the job is stuck in an exiting state, and never moves on. -- Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From bugzilla-daemon at supercluster.org Wed Apr 25 17:31:38 2012 From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org) Date: Wed, 25 Apr 2012 17:31:38 -0600 (MDT) Subject: [torquedev] [Bug 186] New: torque-4.0.0: routing to routing queue does not work Message-ID: http://www.clusterresources.com/bugzilla/show_bug.cgi?id=186 Summary: torque-4.0.0: routing to routing queue does not work Product: TORQUE Version: 3.0.x Platform: PC OS/Version: Linux Status: NEW Severity: blocker Priority: P5 Component: pbs_server AssignedTo: dbeer at adaptivecomputing.com ReportedBy: siegert at sfu.ca CC: torquedev at supercluster.org Estimated Hours: 0.0 Our torque setup involves a default queue, which is a routing queue. One of its destinations is a routing queue as well: create queue default set queue default queue_type = Route set queue default route_destinations = atlas set queue default route_destinations += dev create queue dev set queue dev queue_type = Route set queue dev route_destinations = q1 set queue dev route_destinations += qs set queue dev route_destinations += ql (q1, qs, ql are execution queues). When I submit a job, it always ends up in the dev queue, it never gets delivered into one of the execution queues. As if routing of jobs is tried once only (from default to dev). The setup works without problems under torque-2.5.11. As a consequence I have not been able to test torque-4.0.0 at all; I cannot get a single job to run. -- Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From bugzilla-daemon at supercluster.org Thu Apr 26 00:04:17 2012 From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org) Date: Thu, 26 Apr 2012 00:04:17 -0600 (MDT) Subject: [torquedev] [Bug 187] New: segfault in job_abt after dealing with array dependencies Message-ID: http://www.clusterresources.com/bugzilla/show_bug.cgi?id=187 Summary: segfault in job_abt after dealing with array dependencies Product: TORQUE Version: 3.0.x Platform: PC OS/Version: Linux Status: NEW Severity: major Priority: P5 Component: pbs_server AssignedTo: dbeer at adaptivecomputing.com ReportedBy: rhys.hill at adelaide.edu.au CC: torquedev at supercluster.org Estimated Hours: 0.0 This code in job_abt: if (pjob->ji_wattr[JOB_ATR_depend].at_flags & ATR_VFLAG_SET) { strcpy(jobid, pjob->ji_qs.ji_jobid); depend_on_term(pjob); pjob = find_job(jobid); } /* update internal array bookeeping values */ if ((pjob->ji_arraystruct != NULL) && (pjob->ji_is_array_template == FALSE)) { ... } is causing a seg fault for us, in torque 4.0.1, r6023, since find_job is changing pjob to be null, then the following conditional statement crashes. Strangely, the code within the conditional statement has several checks for pjob being null, while the condition itself does not. This patch: Index: src/server/job_func.c =================================================================== --- src/server/job_func.c (revision 6023) +++ src/server/job_func.c (working copy) @@ -526,10 +526,14 @@ strcpy(jobid, pjob->ji_qs.ji_jobid); depend_on_term(pjob); pjob = find_job(jobid); + if (pjob == NULL){ + log_event(PBSEVENT_JOB, PBS_EVENTCLASS_JOB, jobid, "lost job after setting up dependencies."); } + } /* update internal array bookeeping values */ - if ((pjob->ji_arraystruct != NULL) && + if ((pjob != NULL) && + (pjob->ji_arraystruct != NULL) && (pjob->ji_is_array_template == FALSE)) { job_array *pa = get_jobs_array(&pjob); resolves the crash. It seems like the code was intended to have this check, but it was lost/missed/deleted. -- Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From bugzilla-daemon at supercluster.org Thu Apr 26 00:09:42 2012 From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org) Date: Thu, 26 Apr 2012 00:09:42 -0600 (MDT) Subject: [torquedev] [Bug 160] job dependent upon array stuck in UserHold state In-Reply-To: References: Message-ID: <20120426060942.4AF26412150B@http.supercluster.org> http://www.clusterresources.com/bugzilla/show_bug.cgi?id=160 rhys.hill at adelaide.edu.au changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |rhys.hill at adelaide.edu.au Depends on| |178 --- Comment #1 from rhys.hill at adelaide.edu.au 2012-04-26 00:09:42 MDT --- This works for me with torque 4.0.1 and the patch from Bug 178. I'm testing with: ID1=`qsub -t 1-2 ./stage1.sh` ID2=`qsub -W depend=afterokarray:${ID1} ./stage2.sh` ID3=`qsub -t 1-2 -W depend=afterok:${ID2} ./stage3.sh` -- Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From bugzilla-daemon at supercluster.org Thu Apr 26 00:09:42 2012 From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org) Date: Thu, 26 Apr 2012 00:09:42 -0600 (MDT) Subject: [torquedev] [Bug 178] Array jobs should be allowed to have single job dependencies. In-Reply-To: References: Message-ID: <20120426060942.697FD412150C@http.supercluster.org> http://www.clusterresources.com/bugzilla/show_bug.cgi?id=178 rhys.hill at adelaide.edu.au changed: What |Removed |Added ---------------------------------------------------------------------------- Blocks| |160 -- Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From bugzilla-daemon at supercluster.org Thu Apr 26 13:21:56 2012 From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org) Date: Thu, 26 Apr 2012 13:21:56 -0600 (MDT) Subject: [torquedev] [Bug 186] torque-4.0.0: routing to routing queue does not work In-Reply-To: References: Message-ID: <20120426192156.3FB5641217C9@http.supercluster.org> http://www.clusterresources.com/bugzilla/show_bug.cgi?id=186 dbeer at adaptivecomputing.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |dbeer at adaptivecomputing.com --- Comment #1 from dbeer at adaptivecomputing.com 2012-04-26 13:21:56 MDT --- Martin, I'm looking into a fix for this. To allow you to continue testing - could you submit directly to the dev queue for now? David -- Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From bugzilla-daemon at supercluster.org Fri Apr 27 01:30:28 2012 From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org) Date: Fri, 27 Apr 2012 01:30:28 -0600 (MDT) Subject: [torquedev] [Bug 179] autogen.sh tries to use incorrect m4 directory In-Reply-To: References: Message-ID: <20120427073028.D419B4122F40@http.supercluster.org> http://www.clusterresources.com/bugzilla/show_bug.cgi?id=179 rhys.hill at adelaide.edu.au changed: What |Removed |Added ---------------------------------------------------------------------------- Blocks| |180 --- Comment #1 from rhys.hill at adelaide.edu.au 2012-04-27 01:30:28 MDT --- The patch below fixes configure.ac so that it works with automake-1.9, which is all that's available on RHEL 5. It also resolves bug 180, by putting an explicit check for pkg-config at the start, rather than waiting for the first pkg check to occur, as was the case previously. Index: configure.ac =================================================================== --- configure.ac (revision 6044) +++ configure.ac (working copy) @@ -37,8 +37,11 @@ AC_CANONICAL_HOST AC_CONFIG_MACRO_DIR([m4]) -LT_INIT +dnl Enable libtool +AC_PROG_LIBTOOL +dnl Ensure we have pkg-config +PKG_PROG_PKG_CONFIG AC_CHECK_PROGS(MAKE,$MAKE make gmake,error) if test "x$MAKE" = "xerror" ;then @@ -111,10 +114,9 @@ ]) AM_CONDITIONAL(HAVE_CHECK, test x"$have_check" = "xyes") -m4_ifdef([AM_SILENT_RULES],[ - if test "x$have_check" = "xyes"; then - AM_SILENT_RULES(no) - AC_CONFIG_FILES(src/cmds/test/Makefile +AS_IF([test "x$have_check" = "xyes"],[ + AC_CONFIG_FILES( + src/cmds/test/Makefile src/cmds/test/MXML/Makefile src/cmds/test/common_cmds/Makefile src/cmds/test/pbs_track/Makefile @@ -446,12 +448,19 @@ src/tools/test/printjob/Makefile src/tools/test/printserverdb/Makefile src/tools/test/printtracking/Makefile - src/tools/test/tracejob/Makefile) - else - AM_SILENT_RULES(no) - fi + src/tools/test/tracejob/Makefile + ) ]) +dnl +dnl ###################################################################### +dnl If unit-testing, disable silent rules. Set a variable so that old +dnl automake doesn't create an invalid configure script +AS_IF([test "x$have_check" = "xyes"],[ + m4_ifdef([AM_SILENT_RULES],[AM_SILENT_RULES([no])],[DUMMY_VARIABLE_FOR_OLD_AUTOMAKE=1]) +],[ + m4_ifdef([AM_SILENT_RULES],[AM_SILENT_RULES([yes])],[DUMMY_VARIABLE_FOR_OLD_AUTOMAKE=1]) +]) dnl dnl ###################################################################### -- Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From bugzilla-daemon at supercluster.org Fri Apr 27 01:30:28 2012 From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org) Date: Fri, 27 Apr 2012 01:30:28 -0600 (MDT) Subject: [torquedev] [Bug 180] configure script has hwloc->pkg-config->check dependency In-Reply-To: References: Message-ID: <20120427073028.E9D054122F42@http.supercluster.org> http://www.clusterresources.com/bugzilla/show_bug.cgi?id=180 rhys.hill at adelaide.edu.au changed: What |Removed |Added ---------------------------------------------------------------------------- Depends on| |179 -- Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From bugzilla-daemon at supercluster.org Fri Apr 27 01:32:32 2012 From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org) Date: Fri, 27 Apr 2012 01:32:32 -0600 (MDT) Subject: [torquedev] [Bug 179] autogen.sh tries to use incorrect m4 directory In-Reply-To: References: Message-ID: <20120427073232.884604122F42@http.supercluster.org> http://www.clusterresources.com/bugzilla/show_bug.cgi?id=179 --- Comment #2 from rhys.hill at adelaide.edu.au 2012-04-27 01:32:32 MDT --- Note that the changes to the script around AM_SILENT_RULES remove the warning in the first comment. -- Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From bugzilla-daemon at supercluster.org Fri Apr 27 21:25:11 2012 From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org) Date: Fri, 27 Apr 2012 21:25:11 -0600 (MDT) Subject: [torquedev] [Bug 173] [torque-3.0.4] pbs_mom buffer overflow / segfaults when using --enable-nvidia-gpus [with BUG FIX] In-Reply-To: References: Message-ID: <20120428032511.D161967818E@http.supercluster.org> http://www.clusterresources.com/bugzilla/show_bug.cgi?id=173 --- Comment #3 from Nicolas Pinto 2012-04-27 21:25:11 MDT --- Any update on this front? Is 4.0.0 vulnerable to this bug? -- Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From bugzilla-daemon at supercluster.org Sun Apr 29 05:21:17 2012 From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org) Date: Sun, 29 Apr 2012 05:21:17 -0600 (MDT) Subject: [torquedev] [Bug 188] New: job log deadlock Message-ID: http://www.clusterresources.com/bugzilla/show_bug.cgi?id=188 Summary: job log deadlock Product: TORQUE Version: 3.0.x Platform: PC OS/Version: Linux Status: NEW Severity: critical Priority: P5 Component: pbs_server AssignedTo: dbeer at adaptivecomputing.com ReportedBy: rhys.hill at adelaide.edu.au CC: torquedev at supercluster.org Estimated Hours: 0.0 There is currently a deadlock that commonly occurs when job logging is enabled. The deadlock occurs because the function mk_job_log_name locks job_log_mutex to update the time when the log was opened, even though the lock is already take every time its only caller, job_log_open, is executed. The problem is fixed by simply removing the lock: Index: src/lib/Liblog/pbs_log.c =================================================================== --- src/lib/Liblog/pbs_log.c (revision 6023) +++ src/lib/Liblog/pbs_log.c (working copy) @@ -272,9 +272,7 @@ ptm->tm_mday); } - pthread_mutex_lock(job_log_mutex); joblog_open_day = ptm->tm_yday; /* Julian date log opened */ - pthread_mutex_unlock(job_log_mutex); return(pbuf); } /* END mk_job_log_name() */ the structure of the code then matches mk_log_name. -- Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From bugzilla-daemon at supercluster.org Mon Apr 30 06:49:41 2012 From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org) Date: Mon, 30 Apr 2012 06:49:41 -0600 (MDT) Subject: [torquedev] [Bug 189] New: Dangling pointer in job_func.c Message-ID: http://www.clusterresources.com/bugzilla/show_bug.cgi?id=189 Summary: Dangling pointer in job_func.c Product: TORQUE Version: 3.0.x Platform: PC OS/Version: Linux Status: NEW Severity: major Priority: P5 Component: pbs_server AssignedTo: dbeer at adaptivecomputing.com ReportedBy: rhys.hill at adelaide.edu.au CC: torquedev at supercluster.org Estimated Hours: 0.0 In job_func.c, there are two places where job_purge is called, like so: job_purge(pjob); *pjobp = NULL; where pjob = *pjobp. However, this pattern leaves pjob pointer at a dead structure, and mean that later NULL tests on job are incorrect. Simply doing this: job_purge(pjob); *pjobp = NULL; pjob = NULL; resolves the problem. -- Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From bugzilla-daemon at supercluster.org Mon Apr 30 09:55:06 2012 From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org) Date: Mon, 30 Apr 2012 09:55:06 -0600 (MDT) Subject: [torquedev] [Bug 173] [torque-3.0.4] pbs_mom buffer overflow / segfaults when using --enable-nvidia-gpus [with BUG FIX] In-Reply-To: References: Message-ID: <20120430155506.48B3F4122D56@http.supercluster.org> http://www.clusterresources.com/bugzilla/show_bug.cgi?id=173 dbeer at adaptivecomputing.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |dbeer at adaptivecomputing.com --- Comment #4 from dbeer at adaptivecomputing.com 2012-04-30 09:55:06 MDT --- This hasn't been re-written to use the dynamic string struct yet, but it should happen soon. -- Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.