From lloyd_brown at byu.edu Tue Sep 4 10:24:53 2012 From: lloyd_brown at byu.edu (Lloyd Brown) Date: Tue, 04 Sep 2012 10:24:53 -0600 Subject: [torqueusers] TORQUE 4.1.1 available In-Reply-To: References: Message-ID: <50462B55.7040708@byu.edu> Ken, I can't help but notice that the "CHANGELOGS" link on the torque download page doesn't have anything more recent than 4.0.2. Also, the "Release Notes" for v4.1.1 and v4.1.0 both seem to point to the 4.1.0 documents. You might want to look at this. Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu On 08/30/2012 02:34 PM, Ken Nielson wrote: > TORQUE version 4.1.1 is now available for general download. > > There were several bugs fixed in this version of TORQUE. Several > deadlock issues were fixed around the combination of job arrays and > routing queues. > x11-forwarding was fixed for interactive jobs. > There were fixes for memory corruption and double free. > There were 5 memory leaks that were fixed. > The mail feature we re-enabled. It had been removed in earlier versions > of TORQUE 4.x > > For a complete list of fixes see the CHANGELOG. > > We want to thank The University of Michigan, NOAA, University of > Florida, LBNL and Cray for their help in finding and fixing many of the > bugs for this release. We also appreciate the contributions made by > others to the code base. > > The tar ball for this release can be downloaded at the following URL. > http://www.adaptivecomputing.com/support/download-center/torque-download/torque-4.1.1.tar.gz > > Thanks again for all of the help. The feedback from the community is > what makes TORQUE the best it can be. > > Regards > > Ken Nielson > Adaptive Computing > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From damianmontaldo at gmail.com Tue Sep 4 10:43:41 2012 From: damianmontaldo at gmail.com (Damian Montaldo) Date: Tue, 4 Sep 2012 13:43:41 -0300 Subject: [torqueusers] How to determine the Torque version Message-ID: I'm trying to figure out witch official version of torque is installed when I finally install the stable debian package of torque version 2.4.8+dfsg-9squeeze1 $ apt-cache show torque-server | grep Version Version: 2.4.8+dfsg-9squeeze1 http://packages.debian.org/squeeze/torque-server According to the troubleshooting and FAQs http://www.clusterresources.com/torquedocs21/11.1troubleshooting.shtml#version to get the torque version I need to look at qmgr and print server but there is no version defined in there Anybody could help me? Thanks. On Thu, Aug 30, 2012 at 5:34 PM, Ken Nielson wrote: > TORQUE version 4.1.1 is now available for general download. > > There were several bugs fixed in this version of TORQUE. Several deadlock > issues were fixed around the combination of job arrays and routing queues. > x11-forwarding was fixed for interactive jobs. > There were fixes for memory corruption and double free. > There were 5 memory leaks that were fixed. > The mail feature we re-enabled. It had been removed in earlier versions of > TORQUE 4.x > > For a complete list of fixes see the CHANGELOG. > > We want to thank The University of Michigan, NOAA, University of Florida, > LBNL and Cray for their help in finding and fixing many of the bugs for this > release. We also appreciate the contributions made by others to the code > base. > > The tar ball for this release can be downloaded at the following URL. > http://www.adaptivecomputing.com/support/download-center/torque-download/torque-4.1.1.tar.gz > > Thanks again for all of the help. The feedback from the community is what > makes TORQUE the best it can be. > > Regards > > Ken Nielson > Adaptive Computing > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From gus at ldeo.columbia.edu Tue Sep 4 11:44:08 2012 From: gus at ldeo.columbia.edu (Gus Correa) Date: Tue, 04 Sep 2012 13:44:08 -0400 Subject: [torqueusers] How to determine the Torque version In-Reply-To: References: Message-ID: <50463DE8.5080806@ldeo.columbia.edu> Have you tried: qstat --version or qsub --version ? IHIH Gus Correa On 09/04/2012 12:43 PM, Damian Montaldo wrote: > I'm trying to figure out witch official version of torque is installed > when I finally install the stable debian package of torque version > 2.4.8+dfsg-9squeeze1 > $ apt-cache show torque-server | grep Version > Version: 2.4.8+dfsg-9squeeze1 > > http://packages.debian.org/squeeze/torque-server > > According to the troubleshooting and FAQs > http://www.clusterresources.com/torquedocs21/11.1troubleshooting.shtml#version > > to get the torque version I need to look at qmgr and print server but > there is no version defined in there > > Anybody could help me? > Thanks. > > On Thu, Aug 30, 2012 at 5:34 PM, Ken Nielson > wrote: >> TORQUE version 4.1.1 is now available for general download. >> >> There were several bugs fixed in this version of TORQUE. Several deadlock >> issues were fixed around the combination of job arrays and routing queues. >> x11-forwarding was fixed for interactive jobs. >> There were fixes for memory corruption and double free. >> There were 5 memory leaks that were fixed. >> The mail feature we re-enabled. It had been removed in earlier versions of >> TORQUE 4.x >> >> For a complete list of fixes see the CHANGELOG. >> >> We want to thank The University of Michigan, NOAA, University of Florida, >> LBNL and Cray for their help in finding and fixing many of the bugs for this >> release. We also appreciate the contributions made by others to the code >> base. >> >> The tar ball for this release can be downloaded at the following URL. >> http://www.adaptivecomputing.com/support/download-center/torque-download/torque-4.1.1.tar.gz >> >> Thanks again for all of the help. The feedback from the community is what >> makes TORQUE the best it can be. >> >> Regards >> >> Ken Nielson >> Adaptive Computing >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From damianmontaldo at gmail.com Tue Sep 4 12:17:49 2012 From: damianmontaldo at gmail.com (Damian Montaldo) Date: Tue, 4 Sep 2012 15:17:49 -0300 Subject: [torqueusers] How to determine the Torque version In-Reply-To: <50463DE8.5080806@ldeo.columbia.edu> References: <50463DE8.5080806@ldeo.columbia.edu> Message-ID: On Tue, Sep 4, 2012 at 2:44 PM, Gus Correa wrote: > Have you tried: > qstat --version > or > qsub --version > ? Hi Gus, thanks for your reply. $ qsub --version version: 2.4.8 It seems that the debian package version it's the same as torque. I thought that it was wrong because it was too old. Thanks again. From d-ulrick at comcast.net Tue Sep 4 13:24:57 2012 From: d-ulrick at comcast.net (Dave Ulrick) Date: Tue, 4 Sep 2012 14:24:57 -0500 (CDT) Subject: [torqueusers] Transforming node names in $PBS_NODEFILE and $PBS_GPUFILE Message-ID: Hi, Our 60-node HPC is configured with two local networks: GigE and Infiniband. As host names we've defined cnxx (e.g., cn01 or cn60) for the GigE IPs and icnxx for the IB IPs. The TORQUE (3.0.4) pbs_server and pbs_moms are configured to use the GigE host names so $PBS_NODEFILE and $PBS_GPUFILE naturally present the GigE node names. I am trying to figure out a way to populate these files with the IB node names so MPI traffic will use IB instead of GigE. I've already tried to reconfigure Moab and TORQUE to use the IB nodes but was unsuccessful. After giving the matter more thought, I'm thinking that my users would be happiest if they knew that IB bandwidth was being dedicated to their apps--MPI, NFS, etc.--as opposed to resource manager overhead, so I'd rather not go that route. I've advised my users to consider adding code to their PBS scripts to convert the $PBS_NODEFILE and $PBS_GPUFILE contents as they see fit, but they'd rather not have to bother. I've experimented with job-specific prologue and epilogue scripts but I've not been successful. Both $PBS_NODEFILE and $PBS_GPUFILE are created with 644 permissions and root ownership so the script can't write modified files under the same file names. The script could of course write modified node files under other names, but that wouldn't let them do anything they couldn't do right in the PBS script itself. According to the TORQUE admin manual, the system prologue and epilogue scripts are run as root but with empty environments. If this means that $PBS_NODEFILE and $PBS_GPUFILE aren't provided to the prologue script, I won't be able to transform the files there. Can you think of any way I could convert the node files so that they will be available via the familiar $PBS_NODEFILE and $PBS_GPUFILE environment variables, or is my only hope to reconfigure TORQUE and Moab to use the icnxx node names? Thanks, Dave -- Dave Ulrick d-ulrick at comcast.net From lloyd_brown at byu.edu Tue Sep 4 13:37:48 2012 From: lloyd_brown at byu.edu (Lloyd Brown) Date: Tue, 04 Sep 2012 13:37:48 -0600 Subject: [torqueusers] Transforming node names in $PBS_NODEFILE and $PBS_GPUFILE In-Reply-To: References: Message-ID: <5046588C.9000909@byu.edu> A lot will depend on what MPI implementation you're using. Even if you're able to transform your nodefiles in some fashion, that would imply that you're using the IPoIB components for your main job communication. This means you will have very little latency advantage over GigE, and significantly reduced IB bandwidth as well (although probably still better than GigE; just not as good as IB can do). IPoIB is a good option for when you have no other option, but it's definitely not as good performance as you can get out of IB. A much better solution, in my opinion, is to use an MPI implementation that can speak native IB verbs directly. My personal preference is OpenMPI, which, if compiled correctly, will find the fastest communication medium available (IB before GigE). And then, despite what's in the $PBS_NODEFILE, the job communication will generally go over the that fastest network. Only minimal job setup and status information is communicated over GigE. Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu On 09/04/2012 01:24 PM, Dave Ulrick wrote: > I am trying to figure out a way to populate these files with the IB node > names so MPI traffic will use IB instead of GigE. From d-ulrick at comcast.net Tue Sep 4 13:55:58 2012 From: d-ulrick at comcast.net (Dave Ulrick) Date: Tue, 4 Sep 2012 14:55:58 -0500 (CDT) Subject: [torqueusers] Transforming node names in $PBS_NODEFILE and $PBS_GPUFILE In-Reply-To: <5046588C.9000909@byu.edu> References: <5046588C.9000909@byu.edu> Message-ID: On Tue, 4 Sep 2012, Lloyd Brown wrote: > A lot will depend on what MPI implementation you're using. Even if > you're able to transform your nodefiles in some fashion, that would > imply that you're using the IPoIB components for your main job > communication. This means you will have very little latency advantage > over GigE, and significantly reduced IB bandwidth as well (although > probably still better than GigE; just not as good as IB can do). IPoIB > is a good option for when you have no other option, but it's definitely > not as good performance as you can get out of IB. > > A much better solution, in my opinion, is to use an MPI implementation > that can speak native IB verbs directly. My personal preference is > OpenMPI, which, if compiled correctly, will find the fastest > communication medium available (IB before GigE). And then, despite > what's in the $PBS_NODEFILE, the job communication will generally go > over the that fastest network. Only minimal job setup and status > information is communicated over GigE. We have OpenMPI and MVAPICH2 installed on our cluster. It's good to know that OpenMPI is most likely already doing the right thing. I'll pass this information along to my users. Thanks, Dave -- Dave Ulrick d-ulrick at comcast.net From lloyd_brown at byu.edu Tue Sep 4 14:02:55 2012 From: lloyd_brown at byu.edu (Lloyd Brown) Date: Tue, 04 Sep 2012 14:02:55 -0600 Subject: [torqueusers] Transforming node names in $PBS_NODEFILE and $PBS_GPUFILE In-Reply-To: References: <5046588C.9000909@byu.edu> Message-ID: <50465E6F.4060204@byu.edu> I can't speak to MVAPICH2, but I have a vague recollection that MVAPICH wouldn't work unless it was on IB anyway. I could be misremembering, though. A good way to tell is to do a bandwidth test (eg. "osu_bw" from http://mvapich.cse.ohio-state.edu/benchmarks/), and see what you get. Generally speaking the bandwidth capabilities are different enough to make it pretty obvious. Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu On 09/04/2012 01:55 PM, Dave Ulrick wrote: > On Tue, 4 Sep 2012, Lloyd Brown wrote: > >> A lot will depend on what MPI implementation you're using. Even if >> you're able to transform your nodefiles in some fashion, that would >> imply that you're using the IPoIB components for your main job >> communication. This means you will have very little latency advantage >> over GigE, and significantly reduced IB bandwidth as well (although >> probably still better than GigE; just not as good as IB can do). IPoIB >> is a good option for when you have no other option, but it's definitely >> not as good performance as you can get out of IB. >> >> A much better solution, in my opinion, is to use an MPI implementation >> that can speak native IB verbs directly. My personal preference is >> OpenMPI, which, if compiled correctly, will find the fastest >> communication medium available (IB before GigE). And then, despite >> what's in the $PBS_NODEFILE, the job communication will generally go >> over the that fastest network. Only minimal job setup and status >> information is communicated over GigE. > > We have OpenMPI and MVAPICH2 installed on our cluster. It's good to > know that OpenMPI is most likely already doing the right thing. I'll pass > this information along to my users. > > Thanks, > Dave > From samuel at unimelb.edu.au Tue Sep 4 20:15:18 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Wed, 05 Sep 2012 12:15:18 +1000 Subject: [torqueusers] Transforming node names in $PBS_NODEFILE and $PBS_GPUFILE In-Reply-To: References: <5046588C.9000909@byu.edu> Message-ID: <5046B5B6.4080901@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 05/09/12 05:55, Dave Ulrick wrote: > We have OpenMPI and MVAPICH2 installed on our cluster. It's good to > know that OpenMPI is most likely already doing the right thing. You can check that by looking at a running MPI process and seeing if it's got any of the IB devices open, like /dev/infiniband/uverbs0. cheers, Chris - -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://www.enigmail.net/ iEYEARECAAYFAlBGtbYACgkQO2KABBYQAh8HRACfZYn/2JYOhCQN1kU5N85Qbeo4 Xu0AoInwUt2ioGc0R7vduDW9jYSmYd1L =sUxI -----END PGP SIGNATURE----- From samuel at unimelb.edu.au Tue Sep 4 21:05:32 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Wed, 05 Sep 2012 13:05:32 +1000 Subject: [torqueusers] torque IRC channel In-Reply-To: <5022773D.1040507@cyf-kr.edu.pl> References: <5022773D.1040507@cyf-kr.edu.pl> Message-ID: <5046C17C.4030401@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 09/08/12 00:27, Lukasz Flis wrote: > Is there any IRC channel for Torque community and/or developers > available? I've not used it for decades now, but nothing to stop those that do from congregating in #torque if they so want to. cheers, Chris - -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://www.enigmail.net/ iEYEARECAAYFAlBGwXwACgkQO2KABBYQAh/lhACfZvaJGzhm18CkzSEA8nYdrccZ sa0AnRKl6gqeec6Nw+b0XZfheYJNPyOC =tTRX -----END PGP SIGNATURE----- From go-yoshimura at sstc.co.jp Wed Sep 5 05:15:58 2012 From: go-yoshimura at sstc.co.jp (Go Yoshimura) Date: Wed, 05 Sep 2012 20:15:58 +0900 Subject: [torqueusers] How can we cancel a job array with qdel of torque4.1.0?[SOLVED] Message-ID: <201209051115.AA14079@winxp-pc.sstc.co.jp> Hi! - It seems that if we add -t option, jobname should have []. - I'm not sure this is good approach but if we edit torque-4.1.1/src/cmds/qdel.c so that "[]" is added to job_id, we can cancel jobs in a job array with qdel. ((qdel.c modification)) [test01 at torque02 cmds]$ pwd /usr/local/src/torque-4.1.1/src/cmds [test01 at torque02 cmds]$ diff -u qdel.c qdel.c.orig --- qdel.c 2012-09-05 19:23:32.000000000 +0900 +++ qdel.c.orig 2012-08-25 05:42:48.000000000 +0900 @@ -31,7 +31,6 @@ int any_failed = 0; int purge_completed = FALSE; int located = FALSE; - int tflg = FALSE; char *pc; char job_id[PBS_MAXCLTJOBID]; /* from the command line */ @@ -134,7 +133,6 @@ snprintf(extend,sizeof(extend),"%s%s", ARRAY_RANGE, pc); - tflg = 1; break; @@ -211,11 +209,7 @@ /* check to see if user specified 'all' to delete all jobs */ strcpy(job_id, argv[optind]); - /* add [] to job_id if with -t option */ - if ( tflg != FALSE ) { - strcat(job_id, "[]"); - tflg = FALSE; - } + if (get_server(job_id, job_id_out, server_out)) { fprintf(stderr, "qdel: illegally formed job identifier: %s\n", ((qdel -t test)) [test01 at torque02 ~]$ qstat -t Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 445[5].torque02 STDIN-5 test01 00:00:00 C batch 445[6].torque02 STDIN-6 test01 00:00:00 C batch 445[7].torque02 STDIN-7 test01 00:00:00 C batch 445[8].torque02 STDIN-8 test01 00:00:00 C batch 446[1].torque02 STDIN-1 test01 0 R batch 446[2].torque02 STDIN-2 test01 0 R batch 446[3].torque02 STDIN-3 test01 0 R batch 446[4].torque02 STDIN-4 test01 0 R batch 446[5].torque02 STDIN-5 test01 0 R batch 446[6].torque02 STDIN-6 test01 0 R batch 446[7].torque02 STDIN-7 test01 0 R batch 446[8].torque02 STDIN-8 test01 0 R batch [test01 at torque02 ~]$ /dev/shm/usr/local/torque/bin/qdel -t 5-8 446 [test01 at torque02 ~]$ qdel -t 1-4 446 qdel: Unauthorized Request MSG=must have operator or manager privilege to use -m parameter 446.torque02 thank you go --- ---- Go Yoshimura Scalable Systems Co., Ltd. Osaka Office HONMACHI-COLLABO Bldg. 4F, 4-4-2 Kita-kyuhoji-machi, Chuo-ku, Osaka 541-0057 Japan Tel: 81-6-6224-4115 Tokyo Kojimachi Office BUREX Kojimachi 11F, 3-5-2 Kojimachi, Chiyoda-ku, Tokyo 102-0083 Japan Tel: 81-3-5875-4718 Fax: 81-3-3237-7612 From nt_mahmood at yahoo.com Wed Sep 5 09:50:10 2012 From: nt_mahmood at yahoo.com (Mahmood Naderan) Date: Wed, 5 Sep 2012 08:50:10 -0700 (PDT) Subject: [torqueusers] inspecting running jobs Message-ID: <1346860210.93300.YahooMailNeo@web111701.mail.gq1.yahoo.com> Dear all, Assume I have 20 running jobs. When I use "top" command, I see that there is one process (which is one of my jobs) that uses a lot of memory. How can I find which job number it is? ? Regards, Mahmood -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120905/ca244134/attachment.html From lloyd_brown at byu.edu Wed Sep 5 09:57:32 2012 From: lloyd_brown at byu.edu (Lloyd Brown) Date: Wed, 05 Sep 2012 09:57:32 -0600 Subject: [torqueusers] inspecting running jobs In-Reply-To: <1346860210.93300.YahooMailNeo@web111701.mail.gq1.yahoo.com> References: <1346860210.93300.YahooMailNeo@web111701.mail.gq1.yahoo.com> Message-ID: <5047766C.9040604@byu.edu> This may not be very automatable, but the best tool (at least in Linux) in my opinion is "pstree -p". The running process will be a child (or grandchild, etc.) of a process named something like "jobnumber.schedulerhostname". So, if you use pstree (with the "-p" to show process IDs), search for the PID of the hard-hitting process, and then follow it back up the tree, you should find something. Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu On 09/05/2012 09:50 AM, Mahmood Naderan wrote: > Dear all, > Assume I have 20 running jobs. When I use "top" command, I see that > there is one process (which is one of my jobs) that uses a lot of > memory. How can I find which job number it is? > > Regards, > Mahmood* > * > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From d-ulrick at comcast.net Wed Sep 5 13:35:25 2012 From: d-ulrick at comcast.net (Dave Ulrick) Date: Wed, 5 Sep 2012 14:35:25 -0500 (CDT) Subject: [torqueusers] Transforming node names in $PBS_NODEFILE and $PBS_GPUFILE In-Reply-To: <50465E6F.4060204@byu.edu> References: <5046588C.9000909@byu.edu> <50465E6F.4060204@byu.edu> Message-ID: The results of running 'osu_bw' on two of our nodes certainly suggest IB is being used: # Send Buffer on HOST (H) and Receive Buffer on HOST (H) # Size Bandwidth (MB/s) 1 2.42 2 4.83 4 9.82 8 19.22 16 37.55 32 71.71 64 143.92 128 219.92 256 420.21 512 827.68 1024 1622.58 2048 2132.29 4096 2508.80 8192 2706.56 16384 2799.94 32768 3070.28 65536 3221.56 131072 3307.62 262144 3351.52 524288 3374.24 1048576 3384.84 2097152 3391.51 4194304 3393.63 Dave On Tue, 4 Sep 2012, Lloyd Brown wrote: > I can't speak to MVAPICH2, but I have a vague recollection that MVAPICH > wouldn't work unless it was on IB anyway. I could be misremembering, > though. > > A good way to tell is to do a bandwidth test (eg. "osu_bw" from > http://mvapich.cse.ohio-state.edu/benchmarks/), and see what you get. > Generally speaking the bandwidth capabilities are different enough to > make it pretty obvious. > > Lloyd Brown -- Dave Ulrick d-ulrick at comcast.net From nt_mahmood at yahoo.com Wed Sep 5 13:41:58 2012 From: nt_mahmood at yahoo.com (Mahmood Naderan) Date: Wed, 5 Sep 2012 12:41:58 -0700 (PDT) Subject: [torqueusers] inspecting running jobs In-Reply-To: <5047766C.9040604@byu.edu> References: <1346860210.93300.YahooMailNeo@web111701.mail.gq1.yahoo.com> <5047766C.9040604@byu.edu> Message-ID: <1346874118.96024.YahooMailNeo@web111716.mail.gq1.yahoo.com> With this command, we can find which jobs are running of a particular node. This is the same as "qstat -rn". thanks for you help ? Regards, Mahmood ________________________________ From: Lloyd Brown To: torqueusers at supercluster.org Sent: Wednesday, September 5, 2012 5:57 PM Subject: Re: [torqueusers] inspecting running jobs This may not be very automatable, but the best tool (at least in Linux) in my opinion is "pstree -p".? The running process will be a child (or grandchild, etc.) of a process named something like "jobnumber.schedulerhostname".? So, if you use pstree (with the "-p" to show process IDs), search for the PID of the hard-hitting process, and then follow it back up the tree, you should find something. Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu On 09/05/2012 09:50 AM, Mahmood Naderan wrote: > Dear all, > Assume I have 20 running jobs. When I use "top" command, I see that > there is one process (which is one of my jobs) that uses a lot of > memory. How can I find which job number it is? >? > Regards, > Mahmood* > * > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120905/4772f0d4/attachment.html From d-ulrick at comcast.net Wed Sep 5 14:04:14 2012 From: d-ulrick at comcast.net (Dave Ulrick) Date: Wed, 5 Sep 2012 15:04:14 -0500 (CDT) Subject: [torqueusers] Transforming node names in $PBS_NODEFILE and $PBS_GPUFILE In-Reply-To: <5046B5B6.4080901@unimelb.edu.au> References: <5046588C.9000909@byu.edu> <5046B5B6.4080901@unimelb.edu.au> Message-ID: On Wed, 5 Sep 2012, Christopher Samuel wrote: > On 05/09/12 05:55, Dave Ulrick wrote: > >> We have OpenMPI and MVAPICH2 installed on our cluster. It's good to >> know that OpenMPI is most likely already doing the right thing. > > You can check that by looking at a running MPI process and seeing if > it's got any of the IB devices open, like /dev/infiniband/uverbs0. /dev/infiniband/uverbs0 is indeed open while 'osu_bw' is running. Looks like I don't have to worry about whether TORQUE uses IB or GigE. Thanks, everyone! Your suggestions have been very helpful. Dave -- Dave Ulrick d-ulrick at comcast.net From go-yoshimura at sstc.co.jp Thu Sep 6 07:01:24 2012 From: go-yoshimura at sstc.co.jp (Go Yoshimura) Date: Thu, 06 Sep 2012 22:01:24 +0900 Subject: [torqueusers] How can we cancel a job array with qdel of torque4.1.0?[SOLVED] In-Reply-To: <201209051115.AA14079@winxp-pc.sstc.co.jp> References: <201209051115.AA14079@winxp-pc.sstc.co.jp> Message-ID: <201209061301.AA14095@winxp-pc.sstc.co.jp> Hi! - I'm sorry disturbing you but we found the good approach. qdel -t 4-8 479[] - How to cancel jobs in a job array with qdel is adding "[]" to jobid. (Good) qdel -t 4-8 479[] (Bad) qdel -t 4-8 479 [test01 at torque02 ~]$ qstat -t Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 479[1].torque02 NPB01-1 test01 0 Q batch 479[2].torque02 NPB01-2 test01 0 Q batch 479[3].torque02 NPB01-3 test01 0 Q batch 479[4].torque02 NPB01-4 test01 0 Q batch 479[5].torque02 NPB01-5 test01 0 Q batch 479[6].torque02 NPB01-6 test01 0 Q batch 479[7].torque02 NPB01-7 test01 0 Q batch 479[8].torque02 NPB01-8 test01 0 Q batch [test01 at torque02 ~]$ qdel -t 4-8 479 qdel: Unauthorized Request MSG=must have operator or manager privilege to use -m parameter 479.torque02.ahoaho [test01 at torque02 ~]$ qdel -t 4-8 479[] [test01 at torque02 ~]$ qstat -t Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 479[1].torque02 NPB01-1 test01 0 R batch 479[2].torque02 NPB01-2 test01 0 R batch 479[3].torque02 NPB01-3 test01 0 Q batch 479[4].torque02 NPB01-4 test01 0 C batch 479[5].torque02 NPB01-5 test01 0 C batch 479[6].torque02 NPB01-6 test01 0 C batch 479[7].torque02 NPB01-7 test01 0 C batch 479[8].torque02 NPB01-8 test01 0 C batch thank you go --- Go Yoshimura wrote: >Hi! > >- It seems that if we add -t option, jobname should have []. >- I'm not sure this is good approach but > if we edit torque-4.1.1/src/cmds/qdel.c so that "[]" is added to job_id, > we can cancel jobs in a job array with qdel. > >((qdel.c modification)) >[test01 at torque02 cmds]$ pwd >/usr/local/src/torque-4.1.1/src/cmds >[test01 at torque02 cmds]$ diff -u qdel.c qdel.c.orig >--- qdel.c 2012-09-05 19:23:32.000000000 +0900 >+++ qdel.c.orig 2012-08-25 05:42:48.000000000 +0900 >@@ -31,7 +31,6 @@ > int any_failed = 0; > int purge_completed = FALSE; > int located = FALSE; >- int tflg = FALSE; > char *pc; > > char job_id[PBS_MAXCLTJOBID]; /* from the command line */ >@@ -134,7 +133,6 @@ > snprintf(extend,sizeof(extend),"%s%s", > ARRAY_RANGE, > pc); >- tflg = 1; > > break; > >@@ -211,11 +209,7 @@ > /* check to see if user specified 'all' to delete all jobs */ > > strcpy(job_id, argv[optind]); >- /* add [] to job_id if with -t option */ >- if ( tflg != FALSE ) { >- strcat(job_id, "[]"); >- tflg = FALSE; >- } >+ > if (get_server(job_id, job_id_out, server_out)) > { > fprintf(stderr, "qdel: illegally formed job identifier: %s\n", > >((qdel -t test)) >[test01 at torque02 ~]$ qstat -t >Job id Name User Time Use S Queue >------------------------- ---------------- --------------- -------- - ----- >445[5].torque02 STDIN-5 test01 00:00:00 C batch >445[6].torque02 STDIN-6 test01 00:00:00 C batch >445[7].torque02 STDIN-7 test01 00:00:00 C batch >445[8].torque02 STDIN-8 test01 00:00:00 C batch >446[1].torque02 STDIN-1 test01 0 R batch >446[2].torque02 STDIN-2 test01 0 R batch >446[3].torque02 STDIN-3 test01 0 R batch >446[4].torque02 STDIN-4 test01 0 R batch >446[5].torque02 STDIN-5 test01 0 R batch >446[6].torque02 STDIN-6 test01 0 R batch >446[7].torque02 STDIN-7 test01 0 R batch >446[8].torque02 STDIN-8 test01 0 R batch >[test01 at torque02 ~]$ /dev/shm/usr/local/torque/bin/qdel -t 5-8 446 >[test01 at torque02 ~]$ qdel -t 1-4 446 >qdel: Unauthorized Request MSG=must have operator or manager privilege to use -m parameter 446.torque02 > >thank you >go >--- > >---- >Go Yoshimura >Scalable Systems Co., Ltd. >Osaka Office HONMACHI-COLLABO Bldg. 4F, 4-4-2 Kita-kyuhoji-machi, Chuo-ku, Osaka 541-0057 Japan > Tel: 81-6-6224-4115 >Tokyo Kojimachi Office BUREX Kojimachi 11F, 3-5-2 Kojimachi, Chiyoda-ku, Tokyo 102-0083 Japan > Tel: 81-3-5875-4718 Fax: 81-3-3237-7612 > >_______________________________________________ >torqueusers mailing list >torqueusers at supercluster.org >http://www.supercluster.org/mailman/listinfo/torqueusers ---- Go Yoshimura Scalable Systems Co., Ltd. Osaka Office HONMACHI-COLLABO Bldg. 4F, 4-4-2 Kita-kyuhoji-machi, Chuo-ku, Osaka 541-0057 Japan Tel: 81-6-6224-4115 Tokyo Kojimachi Office BUREX Kojimachi 11F, 3-5-2 Kojimachi, Chiyoda-ku, Tokyo 102-0083 Japan Tel: 81-3-5875-4718 Fax: 81-3-3237-7612 From samuel at unimelb.edu.au Mon Sep 10 00:24:15 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Mon, 10 Sep 2012 16:24:15 +1000 Subject: [torqueusers] How can we cancel a job array with qdel of torque4.1.0?[SOLVED] In-Reply-To: <201209051115.AA14079@winxp-pc.sstc.co.jp> References: <201209051115.AA14079@winxp-pc.sstc.co.jp> Message-ID: <504D878F.6020800@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 05/09/12 21:15, Go Yoshimura wrote: > - It seems that if we add -t option, jobname should have []. - I'm > not sure this is good approach but if we edit > torque-4.1.1/src/cmds/qdel.c so that "[]" is added to job_id, we > can cancel jobs in a job array with qdel. Can I suggest you submit that as a bug to the Torque bugzilla please? http://www.clusterresources.com/bugzilla/ Thanks! Chris - -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://www.enigmail.net/ iEYEARECAAYFAlBNh48ACgkQO2KABBYQAh8IrgCfYbFmGXi1rRsbTRJuHUIOmu97 QQEAnRpdeMp81sUi5UcYDnyrMKAaCvV1 =CA/c -----END PGP SIGNATURE----- From mej at lbl.gov Mon Sep 10 17:06:21 2012 From: mej at lbl.gov (Michael Jennings) Date: Mon, 10 Sep 2012 16:06:21 -0700 Subject: [torqueusers] NHC 1.2 beta Message-ID: <20120910230620.GA8827@lbl.gov> If you're using Warewulf NHC, you might be interested to know that I just released the first beta for version 1.2. There are several new features, including: - nVidia HealthMon support - Detached mode - Variables in config files - More checks (including disk space checks) - More customization of command paths, etc. - Complete unit test suite The tarball is available at http://warewulf.lbl.gov/downloads/beta/warewulf-nhc-1.2beta1.tar.gz The full release announcement with more details on the new features is at: https://groups.google.com/a/lbl.gov/forum/?fromgroups=#!topic/warewulf/mR0cWOm3pEg If you're not using NHC yet, I hope the new features will prompt you to take another look! :-) Feedback is always welcome to either mailing list or to me directly, whatever you prefer. Enjoy! Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From s.prabhakaran at grs-sim.de Mon Sep 10 12:40:17 2012 From: s.prabhakaran at grs-sim.de (Suraj Prabhakaran) Date: Mon, 10 Sep 2012 20:40:17 +0200 Subject: [torqueusers] Maui/Torque with node properties Message-ID: <00DF894B-F5A2-4F73-BD12-1BFC0C6A562A@grs-sim.de> Dear all, I have 4 nodes with the following properties node1 fast node2 fast node3 slow node4 slow Traditionally, torque allows to request nodes with different properties by qsub -l nodes=1:fast+1:slow The above should allocate one fast node and one slow node and this works perfectly fine when pbs_sched is used. But when I use maui as my scheduler, I never get the nodes assigned and end up waiting infinitely. Is this feature supported in maui? Until now, I haven't read anywhere that this feature is not supported in maui. Or, am I just missing something here? Best, Suraj From go-yoshimura at sstc.co.jp Tue Sep 11 02:28:37 2012 From: go-yoshimura at sstc.co.jp (Go Yoshimura) Date: Tue, 11 Sep 2012 17:28:37 +0900 Subject: [torqueusers] How can we cancel a job array with qdel of torque4.1.0?[SOLVED] In-Reply-To: <504D878F.6020800@unimelb.edu.au> References: <504D878F.6020800@unimelb.edu.au> Message-ID: <201209110828.AA14124@winxp-pc.sstc.co.jp> Hi Chris! I have submitted this issue here http://www.clusterresources.com/bugzilla/show_bug.cgi?id=215 thank you go ---- Christopher Samuel wrote: >-----BEGIN PGP SIGNED MESSAGE----- >Hash: SHA1 > >On 05/09/12 21:15, Go Yoshimura wrote: > >> - It seems that if we add -t option, jobname should have []. - I'm >> not sure this is good approach but if we edit >> torque-4.1.1/src/cmds/qdel.c so that "[]" is added to job_id, we >> can cancel jobs in a job array with qdel. > >Can I suggest you submit that as a bug to the Torque bugzilla please? > >http://www.clusterresources.com/bugzilla/ > >Thanks! >Chris >- -- > Christopher Samuel Senior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.org.au/ http://twitter.com/vlsci > >-----BEGIN PGP SIGNATURE----- >Version: GnuPG v1.4.11 (GNU/Linux) >Comment: Using GnuPG with Mozilla - http://www.enigmail.net/ > >iEYEARECAAYFAlBNh48ACgkQO2KABBYQAh8IrgCfYbFmGXi1rRsbTRJuHUIOmu97 >QQEAnRpdeMp81sUi5UcYDnyrMKAaCvV1 >=CA/c >-----END PGP SIGNATURE----- >_______________________________________________ >torqueusers mailing list >torqueusers at supercluster.org >http://www.supercluster.org/mailman/listinfo/torqueusers ---- Go Yoshimura Scalable Systems Co., Ltd. Osaka Office HONMACHI-COLLABO Bldg. 4F, 4-4-2 Kita-kyuhoji-machi, Chuo-ku, Osaka 541-0057 Japan Tel: 81-6-6224-4115 Tokyo Kojimachi Office BUREX Kojimachi 11F, 3-5-2 Kojimachi, Chiyoda-ku, Tokyo 102-0083 Japan Tel: 81-3-5875-4718 Fax: 81-3-3237-7612 From akshar.bhosale at gmail.com Wed Sep 12 13:08:36 2012 From: akshar.bhosale at gmail.com (akshar bhosale) Date: Thu, 13 Sep 2012 00:38:36 +0530 Subject: [torqueusers] reservation diagnosesis Message-ID: Hi, we have torque and maui on rhel 5.2 clustre, When one of the users tried to reserve the reservation, on 25 nodes, showres will show N/P as 25/1600 insted of 25/200. What could be the issue? From Rob.Holmes at bmtwbm.com.au Wed Sep 12 16:57:01 2012 From: Rob.Holmes at bmtwbm.com.au (Rob Holmes) Date: Wed, 12 Sep 2012 22:57:01 +0000 Subject: [torqueusers] spreading load over nodes Message-ID: <74C3EAAEAFC2E746A6BDC0F215CABBD704AD2F@wbm-mail.bmt-wbm.local> Hi all, We have a small cluster (running torque & pbs_sched) that is only occasionally fully utilized. Most of the time there are not enough jobs to fill all the nodes. When a job is submitted it always goes to the first available node in the node list, resulting in ?node01? getting significantly more work than ?node14?. I?m keen to spread the workload more evenly across each node. Is there a way to get torque to pick a free node at random, rather than the first free node on the list? Cheers, Rob Rob Holmes Environmental Scientist ? Catchments and Receiving Environments BMT WBM Pty Ltd Level 8, 200 Creek Street Brisbane QLD 4000 Australia P: +61 7 3831 6744 F: W: www.bmtwbm.com.au [cid:imagef91cc8.GIF at 77c47869.44bece23] [cid:image2241b1.GIF at bfeb77d1.4488039a] E-mail confidentiality notice and disclaimer: The contents of this e-mail are intended for the use of the mail addressee(s) shown. If you are not that person, you are not allowed to read, action, copy, forward, distribute or disclose the contents and you should delete it from your system. BMT WBM accepts no liability for any errors or omissions in the content of this e-mail, nor does it accept liability for statements which are those of the author and clearly not made on behalf of the company. Commercial Terms and Conditions: Unless otherwise agreed by BMT WBM in writing, all services or products supplied by BMT WBM shall be subject to and governed by BMT WBM?s standard terms and conditions, which are available upon request. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120912/47521698/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: imagef91cc8.GIF Type: image/gif Size: 3074 bytes Desc: imagef91cc8.GIF Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120912/47521698/attachment-0002.gif -------------- next part -------------- A non-text attachment was scrubbed... Name: image2241b1.GIF Type: image/gif Size: 3455 bytes Desc: image2241b1.GIF Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120912/47521698/attachment-0003.gif From akshar.bhosale at gmail.com Wed Sep 12 22:24:06 2012 From: akshar.bhosale at gmail.com (akshar bhosale) Date: Thu, 13 Sep 2012 09:54:06 +0530 Subject: [torqueusers] Fwd: reservation diagnosesis In-Reply-To: References: Message-ID: Hi, i could search that 1600 is visible to torque whereas maui (maui logs) says it is 200. ---------- Forwarded message ---------- From: akshar bhosale Date: Thu, 13 Sep 2012 00:38:36 +0530 Subject: reservation diagnosesis To: torqueusers at supercluster.org, mauiusers Hi, we have torque and maui on rhel 5.2 clustre, When one of the users tried to reserve the reservation, on 25 nodes, showres will show N/P as 25/1600 insted of 25/200. What could be the issue? From craig.tierney at noaa.gov Thu Sep 13 11:22:57 2012 From: craig.tierney at noaa.gov (Craig Tierney) Date: Thu, 13 Sep 2012 11:22:57 -0600 Subject: [torqueusers] Fwd: reservation diagnosesis In-Reply-To: References: Message-ID: Do you have 8 core nodes? Then it is probably doing the right thing. My reservations (in Moab, but still the same) are listed as tasks where a task takes the whole node. If you sent us the output of the commands you are running that also would help figure out what is going on. Craig On Wed, Sep 12, 2012 at 10:24 PM, akshar bhosale wrote: > Hi, > > i could search that 1600 is visible to torque whereas maui (maui > logs) says it is 200. > > ---------- Forwarded message ---------- > From: akshar bhosale > Date: Thu, 13 Sep 2012 00:38:36 +0530 > Subject: reservation diagnosesis > To: torqueusers at supercluster.org, mauiusers > > Hi, > > we have torque and maui on rhel 5.2 clustre, When one of the users > tried to reserve the reservation, on 25 nodes, showres will show N/P > as 25/1600 insted of 25/200. What could be the issue? > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From Adrian.Sevcenco at cern.ch Sun Sep 16 11:57:34 2012 From: Adrian.Sevcenco at cern.ch (Adrian Sevcenco) Date: Sun, 16 Sep 2012 20:57:34 +0300 Subject: [torqueusers] changing torque variables (PBS_O_*) BEFORE job is executed but AFTER submission Message-ID: <5056130E.5030406@cern.ch> Hi! I want to change the PBS_O_WORKDIR BEFORE a job is executed but after submission by GRID middleware ... is it possible and how? Thank you! Adrian -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 1997 bytes Desc: S/MIME Cryptographic Signature Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120916/a36ce6b8/attachment.bin From damianmontaldo at gmail.com Tue Sep 18 07:09:59 2012 From: damianmontaldo at gmail.com (Damian Montaldo) Date: Tue, 18 Sep 2012 10:09:59 -0300 Subject: [torqueusers] Scheduling GPUs before 2.5.4 Message-ID: Can anyone confirm after which version of torque is supported the schedule of gpus? In the doc, section 3.7 it said after 2.5.4 http://www.clusterresources.com/torquedocs21/3.7schedulinggpus.shtml Is that correct? I'm using debian and the laster version available is 2.4.16 http://packages.debian.org/search?keywords=torque-server&searchon=names&suite=all§ion=all Thanks again. On Tue, Sep 4, 2012 at 3:17 PM, Damian Montaldo wrote: > On Tue, Sep 4, 2012 at 2:44 PM, Gus Correa wrote: >> Have you tried: >> qstat --version >> or >> qsub --version >> ? > > Hi Gus, thanks for your reply. > > $ qsub --version > version: 2.4.8 > > It seems that the debian package version it's the same as torque. > I thought that it was wrong because it was too old. > > Thanks again. From damianmontaldo at gmail.com Wed Sep 19 16:39:31 2012 From: damianmontaldo at gmail.com (Damian Montaldo) Date: Wed, 19 Sep 2012 19:39:31 -0300 Subject: [torqueusers] Scheduling GPUs before 2.5.4 In-Reply-To: References: Message-ID: On Tue, Sep 18, 2012 at 10:09 AM, Damian Montaldo wrote: > Can anyone confirm after which version of torque is supported the > schedule of gpus? > > I'm using debian and the laster version available is 2.4.16 > http://packages.debian.org/search?keywords=torque-server&searchon=names&suite=all§ion=all In case that someone is following this thread, I found more info. I found that it is already reported as a bug in the debian tracking system the lack of updates of the debian package and the support of scheduling gpus. http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=641484 Anyway I'm trying to contact the maintainers of the package to provide some feedback. Thanks again. From d-ulrick at comcast.net Thu Sep 27 15:27:30 2012 From: d-ulrick at comcast.net (Dave Ulrick) Date: Thu, 27 Sep 2012 16:27:30 -0500 (CDT) Subject: [torqueusers] Cleaning up stray processes from defunct jobs Message-ID: On occasion I see a user run an MPI job via TORQUE that doesn't shut down cleanly and as a result leaves running processes behind to interfere with subsequent jobs that are assigned to its nodes. Any suggestions on how I might go about simplifying the task of finding and killing these processes? Thanks, Dave -- Dave Ulrick d-ulrick at comcast.net From tbaer at utk.edu Thu Sep 27 15:40:04 2012 From: tbaer at utk.edu (Troy Baer) Date: Thu, 27 Sep 2012 17:40:04 -0400 Subject: [torqueusers] Cleaning up stray processes from defunct jobs In-Reply-To: References: Message-ID: <1348782004.15740.165.camel@browncoat.jics.utk.edu> On Thu, 2012-09-27 at 16:27 -0500, Dave Ulrick wrote: > On occasion I see a user run an MPI job via TORQUE that doesn't shut down > cleanly and as a result leaves running processes behind to interfere with > subsequent jobs that are assigned to its nodes. Any suggestions on how I > might go about simplifying the task of finding and killing these > processes? I would recommend running something like reaver [1] in your epilogue.parallel on each node. [1] http://svn.nics.tennessee.edu/repos/pbstools/trunk/sbin/reaver --Troy -- Troy Baer, Senior HPC System Administrator National Institute for Computational Sciences, University of Tennessee http://www.nics.tennessee.edu/ Phone: 865-241-4233 From David.Singleton at anu.edu.au Thu Sep 27 16:04:15 2012 From: David.Singleton at anu.edu.au (David Singleton) Date: Fri, 28 Sep 2012 08:04:15 +1000 Subject: [torqueusers] Cleaning up stray processes from defunct jobs In-Reply-To: References: Message-ID: <5064CD5F.4090002@anu.edu.au> On 09/28/2012 07:27 AM, Dave Ulrick wrote: > On occasion I see a user run an MPI job via TORQUE that doesn't shut down > cleanly and as a result leaves running processes behind to interfere with > subsequent jobs that are assigned to its nodes. Any suggestions on how I > might go about simplifying the task of finding and killing these > processes? > Only support MPIs that use the tm API. You'll have to block ssh between nodes to enforce this. Cheers David From andrew.lahiff at stfc.ac.uk Fri Sep 28 00:08:24 2012 From: andrew.lahiff at stfc.ac.uk (andrew.lahiff at stfc.ac.uk) Date: Fri, 28 Sep 2012 06:08:24 +0000 Subject: [torqueusers] Torque 4.1.2 pbs_server crashes when running jobs with files to stage-in Message-ID: Hi, I've setup a small test batch system using Torque 4.1.2. If I just run very simple test jobs, e.g. qsub -q gridS sleep.sh where the script sleep.sh is shown below (*), everything is fine. However, whenever I try to submit a job including stage in, e.g. qsub -q gridS -W stagein="hosts at lcgvm17:/etc/hosts" sleep.sh then pbs_server crashes. The last few lines of the pbs_server log file look like this: 09/27/2012 22:41:37;0080;PBS_Server.29253;Req;dis_request_read;decoding command AlternateUserAuthentication from dteam087 09/27/2012 22:41:37;0100;PBS_Server.29253;Req;;Type AlternateUserAuthentication request received from dteam087 at lcgvm17, sock=10 09/27/2012 22:41:37;0001;PBS_Server.29254;Svr;PBS_Server;svr_setjobstate: setting job 79509.cloud041 state from QUEUED-PRESTAGEIN to RUNNING-STAGEGO (4-15) 09/27/2012 22:41:37;0008;PBS_Server.29254;Job;reply_send_svr;Reply sent for request type RunJob on socket 9 09/27/2012 22:41:37;0001;PBS_Server.29255;Svr;PBS_Server;svr_setjobstate: setting job 79509.cloud041 state from RUNNING-STAGEGO to RUNNING-PRERUN (4-40) When running pbs_server with gdb and submitting the same type of job, I see this: allocated node cloud126/0 to job 79509.cloud041 (nsnfree=24) Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fffee1fc700 (LWP 29255)] 0x000000000044c97b in send_job_to_mom (pjob_ptr=0x7fffee1fbc08, preq=0x0, parent_job=0x0) at req_runjob.c:1110 1110 if (preq->rq_reply.brp_un.brp_txt.brp_str != NULL) With Torque 4.1.0 everything was fine and I didn't experience this problem, but with Torque 4.1.1 pbs_server crashes as well. I'm using Linux 2.6.32-220.17.1.el6.x86_64. Has anyone else experienced this issue, or know what could be causing it? Many Thanks, Andrew. (*) #!/bin/sh sleep 10 hostname -- Scanned by iCritical. From rhys.hill at adelaide.edu.au Sun Sep 30 04:36:14 2012 From: rhys.hill at adelaide.edu.au (Rhys Hill) Date: Sun, 30 Sep 2012 10:36:14 +0000 Subject: [torqueusers] Error requeuing job Message-ID: <6F5EFC86-5DF8-4036-89D8-681622EF4CBE@adelaide.edu.au> Hi everyone, I have a particular job that I run regularly as part of a development project. On the occasions where torque gets stuck, this particular job is always lost when the daemon is restarted, even though all the other jobs seem to return OK. I always get a message along these lines: Unable to requeue job, queue is not defined; job XXX queue batch where the qstat -q says: server: XXX Queue Memory CPU Time Walltime Node Run Que Lm State ---------------- ------ -------- -------- ---- --- --- -- ----- large -- -- 24:00:00 -- 0 0 -- E R long_running -- -- -- -- 0 0 -- E R image_search -- -- -- -- 0 0 -- E R batch -- -- 48:00:00 -- 122 7 -- E R ----- ----- 122 7 so obviously the queue is actually there. I submit the jobs using a script like this: --- #!/bin/sh DS_JOB=`qsub -l walltime=24:00:00 -l nodes=1:type2:ppn=16 -l vmem=18G ./data_statistics.sh` JOBS=`ls */job.sh` DEPS=; for j in ${JOBS}; do JOB_ID=`qsub -l walltime=24:00:00 -l nodes=1:type2:ppn=16 -W afterok:${DS_JOB} -l vmem=18G $j` if [ "${DEPS}x" = "x" ]; then DEPS="afterok:${JOB_ID}" else DEPS="${DEPS},afterok:${JOB_ID}" fi done qsub -l walltime=24:00:00 -l nodes=1:type2:ppn=16 -l vmem=18G -W depend=${DEPS} ./run_report.sh --- ie. the data_statistics.sh job runs first, followed by several instances of job.sh, then run_report.sh The server log looks like this in total: 09/30/2012 19:02:55;0100;PBS_Server;Job;6614.XXX;enqueuing into batch, state 4 hop 1 09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error 09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX 09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not find work task for local request 09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::pbsd_init_reque, Unable to requeue job, queue is not defined; job 6614.XXX queue batch 09/30/2012 19:02:55;0001;PBS_Server;Req;;Server could not connect to MOM 09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::job_abt, Unable to abort Job 6614.XXX which was in substate 42 09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error 09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX 09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not find work task for local request 09/30/2012 19:02:55;0100;PBS_Server;Job;6614.XXX;dequeuing from batch, state RUNNING 09/30/2012 19:02:55;0100;PBS_Server;Job;6615.XXX;enqueuing into batch, state 1 hop 1 09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error 09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX 09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not find work task for local request 09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::pbsd_init_reque, Unable to requeue job, queue is not defined; job 6615.XXX queue batch 09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error 09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX 09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not find work task for local request 09/30/2012 19:02:55;0100;PBS_Server;Job;6615.XXX;dequeuing from batch, state EXITING 09/30/2012 19:02:55;0100;PBS_Server;Job;6616.XXX;enqueuing into batch, state 1 hop 1 09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error 09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX 09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not find work task for local request 09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::pbsd_init_reque, Unable to requeue job, queue is not defined; job 6616.XXX queue batch 09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error 09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX 09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not find work task for local request 09/30/2012 19:02:55;0100;PBS_Server;Job;6616.XXX;dequeuing from batch, state EXITING 09/30/2012 19:02:55;0100;PBS_Server;Job;6617.XXX;enqueuing into batch, state 2 hop 1 09/30/2012 19:02:55;0080;PBS_Server;Job;6614.XXX;Unknown Job Id Error 09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX 09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not find work task for local request 09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::pbsd_init_reque, Unable to requeue job, queue is not defined; job 6617.XXX queue batch 09/30/2012 19:02:55;0100;PBS_Server;Job;6617.XXX;dequeuing from batch, state EXITING we're using moab for scheduling, if that makes any difference. Any ideas? Cheers, -------------------------------------------------------------------------------- Rhys Hill, Senior Research Associate Australian Centre for Visual Technologies University of Adelaide Phone: +61 8 8313 6197 Mail: Fax: +61 8 8313 4366 School of Computer Science University of Adelaide Adelaide, Australia http://www.cs.adelaide.edu.au/~rhys/ 5005 -------------------------------------------------------------------------------- From samuel at unimelb.edu.au Sun Sep 30 23:12:41 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Mon, 01 Oct 2012 15:12:41 +1000 Subject: [torqueusers] Cleaning up stray processes from defunct jobs In-Reply-To: <5064CD5F.4090002@anu.edu.au> References: <5064CD5F.4090002@anu.edu.au> Message-ID: <50692649.8040807@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 28/09/12 08:04, David Singleton wrote: > Only support MPIs that use the tm API. You'll have to block ssh > between nodes to enforce this. You can also support many non-TM enabled MPI stacks with the OSC mpiexec replacement here: https://www.osc.edu/~djohnson/mpiexec/ cheers, Chris - -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://www.enigmail.net/ iEYEARECAAYFAlBpJkkACgkQO2KABBYQAh/y9ACeM3AhOA9CIzvmh3VL8zHXVTeg xYUAnile4q60vdab7MPdvvkfKr1Nt2w1 =ihHk -----END PGP SIGNATURE----- From gezhengzheng612 at gmail.com Wed Sep 5 03:37:13 2012 From: gezhengzheng612 at gmail.com (zhengzheng ge) Date: Wed, 05 Sep 2012 09:37:13 -0000 Subject: [torqueusers] GPU Configuration Message-ID: Hi, I am using Torque-4.1.0 now.And the GPU is NVIDIA GPU,I want to use opencl .Now the opencl was configured on the node.But how to configure the torque to use GPU ? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120905/feb3b2e5/attachment-0001.html From bloring at lbl.gov Wed Sep 5 12:45:48 2012 From: bloring at lbl.gov (Burlen Loring) Date: Wed, 05 Sep 2012 18:45:48 -0000 Subject: [torqueusers] query job memory limits Message-ID: <5047A082.6020008@lbl.gov> Hi, Is it possible to query a job's memory limit from withing the running program? I'm assuming that there may be a node/host wide total imposed by the batch system. For example on an SGI UV which looks like a single host to the program, I'd like to be able to know how much ram my all processes in my program can use in order to size internal caches appropriately. It seems that this limit is enforced by the batch system, however there doesn't seem to be consistency across various sites. Any advise would be greatly appreciated. Burlen From doug.holt at gmail.com Mon Sep 10 09:10:54 2012 From: doug.holt at gmail.com (doug holt) Date: Mon, 10 Sep 2012 15:10:54 -0000 Subject: [torqueusers] Socket issues in Torque 4.1.x In-Reply-To: References: Message-ID: Since switching from branch 3.0.x to 4.1.x we've been encountering an issue where we appear to be running out of available sockets while queuing/scheduling jobs. We routinely queue 10's of thousands of jobs at a time (up to around 30-40k total) and after several hundred or a thousand I start seeing these errors in the logs and random jobs get dropped (not queued). I've tried limiting the rate at which I add jobs, adjusting the number of open files (ulimit -n 32788), adjusting TCP_WAIT timeout from 60 to 5 seconds (/proc/sys/net/ipv4/tcp_fin_timeout), etc. This is essentially a brand-new system with a default installation of Torque 4.1.1. 09/08/2012 12:16:54;0001;PBS_Server.42666;Svr;PBS_Server;LOG_ERROR::wait_request, closed connections to fd 29 - num_connections=74 (select bad socket) 09/08/2012 12:16:54;0001;PBS_Server.42666;Svr;PBS_Server;LOG_ERROR::wait_request, closed connections to fd 12 - num_connections=68 (select bad socket) 09/08/2012 12:16:56;0001;PBS_Server.42666;Svr;PBS_Server;LOG_ERROR::wait_request, closed connections to fd 12 - num_connections=59 (select bad socket) 09/08/2012 12:16:56;0001;PBS_Server.42666;Svr;PBS_Server;LOG_ERROR::wait_request, closed connections to fd 29 - num_connections=54 (select bad socket) 09/08/2012 12:16:56;0001;PBS_Server.42666;Svr;PBS_Server;LOG_ERROR::wait_request, closed connections to fd 32 - num_connections=52 (select bad socket) 09/08/2012 12:16:58;0001;PBS_Server.42666;Svr;PBS_Server;LOG_ERROR::wait_request, closed connections to fd 78 - num_connections=27 (select bad socket) 09/08/2012 12:16:59;0001;PBS_Server.42666;Svr;PBS_Server;LOG_ERROR::wait_request, closed connections to fd 11 - num_connections=17 (select bad socket) 09/08/2012 00:13:32;0001;PBS_Server.40767;Svr;PBS_Server;LOG_ERROR::Bad file descriptor (9) in wait_request, Unable to select sockets to read requests 09/08/2012 00:14:43;0001;PBS_Server.40767;Svr;PBS_Server;LOG_ERROR::Bad file descriptor (9) in wait_request, Unable to select sockets to read requests 09/08/2012 00:16:09;0001;PBS_Server.40767;Svr;PBS_Server;LOG_ERROR::Bad file descriptor (9) in wait_request, Unable to select sockets to read requests 09/08/2012 00:18:24;0001;PBS_Server.40767;Svr;PBS_Server;LOG_ERROR::Bad file descriptor (9) in wait_request, Unable to select sockets to read requests 09/08/2012 00:19:23;0001;PBS_Server.40767;Svr;PBS_Server;LOG_ERROR::Bad file descriptor (9) in wait_request, Unable to select sockets to read requests Even when there are only a few hundred socket connections I'll still get messages like this when using various commands: -bash-4.1# qstat Error code - 15096 : message [cannot connect to port -1 in socket_connect_addr - errno:9 Bad file descriptor] parse_daemon_response error Error communicating with node(xxx.xxx.xxx.xxx) Communication failure. qstat: cannot connect to server node (errno=15096) Error getting connection to socket Any suggestions? Thanks, Doug Holt From doug.holt at gmail.com Tue Sep 11 07:31:24 2012 From: doug.holt at gmail.com (doug holt) Date: Tue, 11 Sep 2012 13:31:24 -0000 Subject: [torqueusers] Socket issues in Torque 4.1.x In-Reply-To: References: Message-ID: Since switching from branch 3.0.x to 4.1.x we've been encountering an issue where we appear to be running out of available sockets while queuing/scheduling jobs. We routinely queue 10's of thousands of jobs at a time (up to around 30-40k total) and after several hundred or a thousand I start seeing these errors in the logs and random jobs get dropped (not queued). I've tried limiting the rate at which I add jobs, adjusting the number of open files (ulimit -n 32788), adjusting TCP_WAIT timeout from 60 to 5 seconds (/proc/sys/net/ipv4/tcp_fin_timeout), etc. This is essentially a brand-new system with a default installation of Torque 4.1.1. 09/08/2012 12:16:54;0001;PBS_Server.42666;Svr;PBS_Server;LOG_ERROR::wait_request, closed connections to fd 29 - num_connections=74 (select bad socket) 09/08/2012 12:16:54;0001;PBS_Server.42666;Svr;PBS_Server;LOG_ERROR::wait_request, closed connections to fd 12 - num_connections=68 (select bad socket) 09/08/2012 12:16:56;0001;PBS_Server.42666;Svr;PBS_Server;LOG_ERROR::wait_request, closed connections to fd 12 - num_connections=59 (select bad socket) 09/08/2012 12:16:56;0001;PBS_Server.42666;Svr;PBS_Server;LOG_ERROR::wait_request, closed connections to fd 29 - num_connections=54 (select bad socket) 09/08/2012 12:16:56;0001;PBS_Server.42666;Svr;PBS_Server;LOG_ERROR::wait_request, closed connections to fd 32 - num_connections=52 (select bad socket) 09/08/2012 12:16:58;0001;PBS_Server.42666;Svr;PBS_Server;LOG_ERROR::wait_request, closed connections to fd 78 - num_connections=27 (select bad socket) 09/08/2012 12:16:59;0001;PBS_Server.42666;Svr;PBS_Server;LOG_ERROR::wait_request, closed connections to fd 11 - num_connections=17 (select bad socket) 09/08/2012 00:13:32;0001;PBS_Server.40767;Svr;PBS_Server;LOG_ERROR::Bad file descriptor (9) in wait_request, Unable to select sockets to read requests 09/08/2012 00:14:43;0001;PBS_Server.40767;Svr;PBS_Server;LOG_ERROR::Bad file descriptor (9) in wait_request, Unable to select sockets to read requests 09/08/2012 00:16:09;0001;PBS_Server.40767;Svr;PBS_Server;LOG_ERROR::Bad file descriptor (9) in wait_request, Unable to select sockets to read requests 09/08/2012 00:18:24;0001;PBS_Server.40767;Svr;PBS_Server;LOG_ERROR::Bad file descriptor (9) in wait_request, Unable to select sockets to read requests 09/08/2012 00:19:23;0001;PBS_Server.40767;Svr;PBS_Server;LOG_ERROR::Bad file descriptor (9) in wait_request, Unable to select sockets to read requests Even when there are only a few hundred socket connections I'll still get messages like this when using various commands: -bash-4.1# qstat Error code - 15096 : message [cannot connect to port -1 in socket_connect_addr - errno:9 Bad file descriptor] parse_daemon_response error Error communicating with node(xxx.xxx.xxx.xxx) Communication failure. qstat: cannot connect to server node (errno=15096) Error getting connection to socket Any suggestions? Thanks, Doug Holt From jonathan.barber at gmail.com Tue Sep 11 15:48:39 2012 From: jonathan.barber at gmail.com (Jonathan Barber) Date: Tue, 11 Sep 2012 21:48:39 -0000 Subject: [torqueusers] Torque 4.1.2: MUNGE vs trqauthd Message-ID: I'm looking at torque 4.1.2 and trying to work out what the difference is between MUNGE and trqauthd. According to the fine documentation here: http://www.adaptivecomputing.com/resources/docs/torque/4-0/Content/topics/1-installConfig/configuringTrqauthdForClientCom.htm and here: http://www.adaptivecomputing.com/resources/docs/torque/4-0/Content/topics/1-installConfig/serverConfig.htm#usingMUNGEAuth they both seem to have the purpose of confirming the user's identity to the Torque server, and are exclusive options (according to the function pbs_original_connect in pbsD_connect.c). Am I right in thinking this? If so, which one should I be using and why? Regards -- Jonathan Barber From douglas.holt at st.com Wed Sep 12 07:26:06 2012 From: douglas.holt at st.com (Douglas Holt) Date: Wed, 12 Sep 2012 13:26:06 -0000 Subject: [torqueusers] Socket issues in Torque 4.1.x In-Reply-To: References: Message-ID: <50508D45.5030502@st.com> Since switching from branch 3.0.x to 4.1.x we've been encountering an issue where we appear to be running out of available sockets while queuing/scheduling jobs. We routinely queue 10's of thousands of jobs at a time (up to around 30-40k total) and after several hundred or a thousand I start seeing these errors in the logs and random jobs get dropped (not queued). I've tried limiting the rate at which I add jobs, adjusting the number of open files (ulimit -n 32788), adjusting TCP_WAIT timeout from 60 to 5 seconds (/proc/sys/net/ipv4/tcp_fin_timeout), etc. This is essentially a brand-new system with a default installation of Torque 4.1.1. 09/08/2012 12:16:54;0001;PBS_Server.42666;Svr;PBS_Server;LOG_ERROR::wait_request, closed connections to fd 29 - num_connections=74 (select bad socket) 09/08/2012 12:16:54;0001;PBS_Server.42666;Svr;PBS_Server;LOG_ERROR::wait_request, closed connections to fd 12 - num_connections=68 (select bad socket) 09/08/2012 12:16:56;0001;PBS_Server.42666;Svr;PBS_Server;LOG_ERROR::wait_request, closed connections to fd 12 - num_connections=59 (select bad socket) 09/08/2012 12:16:56;0001;PBS_Server.42666;Svr;PBS_Server;LOG_ERROR::wait_request, closed connections to fd 29 - num_connections=54 (select bad socket) 09/08/2012 12:16:56;0001;PBS_Server.42666;Svr;PBS_Server;LOG_ERROR::wait_request, closed connections to fd 32 - num_connections=52 (select bad socket) 09/08/2012 12:16:58;0001;PBS_Server.42666;Svr;PBS_Server;LOG_ERROR::wait_request, closed connections to fd 78 - num_connections=27 (select bad socket) 09/08/2012 12:16:59;0001;PBS_Server.42666;Svr;PBS_Server;LOG_ERROR::wait_request, closed connections to fd 11 - num_connections=17 (select bad socket) 09/08/2012 00:13:32;0001;PBS_Server.40767;Svr;PBS_Server;LOG_ERROR::Bad file descriptor (9) in wait_request, Unable to select sockets to read requests 09/08/2012 00:14:43;0001;PBS_Server.40767;Svr;PBS_Server;LOG_ERROR::Bad file descriptor (9) in wait_request, Unable to select sockets to read requests 09/08/2012 00:16:09;0001;PBS_Server.40767;Svr;PBS_Server;LOG_ERROR::Bad file descriptor (9) in wait_request, Unable to select sockets to read requests 09/08/2012 00:18:24;0001;PBS_Server.40767;Svr;PBS_Server;LOG_ERROR::Bad file descriptor (9) in wait_request, Unable to select sockets to read requests 09/08/2012 00:19:23;0001;PBS_Server.40767;Svr;PBS_Server;LOG_ERROR::Bad file descriptor (9) in wait_request, Unable to select sockets to read requests Even when there are only a few hundred socket connections I'll still get messages like this when using various commands: -bash-4.1# qstat Error code - 15096 : message [cannot connect to port -1 in socket_connect_addr - errno:9 Bad file descriptor] parse_daemon_response error Error communicating with node(xxx.xxx.xxx.xxx) Communication failure. qstat: cannot connect to server node (errno=15096) Error getting connection to socket Any suggestions? Thanks, Doug Holt From bunk at physik.hu-berlin.de Thu Sep 13 07:48:06 2012 From: bunk at physik.hu-berlin.de (Burkhard Bunk) Date: Thu, 13 Sep 2012 13:48:06 -0000 Subject: [torqueusers] array job crashes server Message-ID: Hi, after setting set server display_job_server_suffix = False (such that JobID = job_number), any attempt to submit an array job with qsub -t