From wannes.van.causbroeck at imdc.be Mon Oct 3 00:29:47 2011 From: wannes.van.causbroeck at imdc.be (Wannes Van Causbroeck) Date: Mon, 3 Oct 2011 08:29:47 +0200 Subject: [torqueusers] numa problems References: <6fe1606a-e98b-4d30-9093-8e2d1d0a3bad@mail> Message-ID: <9EBFECC459B64D448BC1AF3A7CFAB40588CE85@imdc-mail.imdc.local> Thanks for the input guys! Running lstopo i get the following output: Machine (126GB) Socket L#0 (32GB) NUMANode L#0 (P#0 16GB) + L3 L#0 (5118KB) L2 L#0 (512KB) + L1 L#0 (64KB) + Core L#0 + PU L#0 (P#0) L2 L#1 (512KB) + L1 L#1 (64KB) + Core L#1 + PU L#1 (P#4) L2 L#2 (512KB) + L1 L#2 (64KB) + Core L#2 + PU L#2 (P#8) L2 L#3 (512KB) + L1 L#3 (64KB) + Core L#3 + PU L#3 (P#12) L2 L#4 (512KB) + L1 L#4 (64KB) + Core L#4 + PU L#4 (P#16) L2 L#5 (512KB) + L1 L#5 (64KB) + Core L#5 + PU L#5 (P#20) NUMANode L#1 (P#1 16GB) + L3 L#1 (5118KB) L2 L#6 (512KB) + L1 L#6 (64KB) + Core L#6 + PU L#6 (P#24) L2 L#7 (512KB) + L1 L#7 (64KB) + Core L#7 + PU L#7 (P#28) L2 L#8 (512KB) + L1 L#8 (64KB) + Core L#8 + PU L#8 (P#32) L2 L#9 (512KB) + L1 L#9 (64KB) + Core L#9 + PU L#9 (P#36) L2 L#10 (512KB) + L1 L#10 (64KB) + Core L#10 + PU L#10 (P#40) L2 L#11 (512KB) + L1 L#11 (64KB) + Core L#11 + PU L#11 (P#44) Socket L#1 (32GB) NUMANode L#2 (P#2 16GB) + L3 L#2 (5118KB) L2 L#12 (512KB) + L1 L#12 (64KB) + Core L#12 + PU L#12 (P#1) L2 L#13 (512KB) + L1 L#13 (64KB) + Core L#13 + PU L#13 (P#5) L2 L#14 (512KB) + L1 L#14 (64KB) + Core L#14 + PU L#14 (P#9) L2 L#15 (512KB) + L1 L#15 (64KB) + Core L#15 + PU L#15 (P#13) L2 L#16 (512KB) + L1 L#16 (64KB) + Core L#16 + PU L#16 (P#17) L2 L#17 (512KB) + L1 L#17 (64KB) + Core L#17 + PU L#17 (P#21) NUMANode L#3 (P#3 16GB) + L3 L#3 (5118KB) L2 L#18 (512KB) + L1 L#18 (64KB) + Core L#18 + PU L#18 (P#25) L2 L#19 (512KB) + L1 L#19 (64KB) + Core L#19 + PU L#19 (P#29) L2 L#20 (512KB) + L1 L#20 (64KB) + Core L#20 + PU L#20 (P#33) L2 L#21 (512KB) + L1 L#21 (64KB) + Core L#21 + PU L#21 (P#37) L2 L#22 (512KB) + L1 L#22 (64KB) + Core L#22 + PU L#22 (P#41) L2 L#23 (512KB) + L1 L#23 (64KB) + Core L#23 + PU L#23 (P#45) Socket L#2 (32GB) NUMANode L#4 (P#4 16GB) + L3 L#4 (5118KB) L2 L#24 (512KB) + L1 L#24 (64KB) + Core L#24 + PU L#24 (P#2) L2 L#25 (512KB) + L1 L#25 (64KB) + Core L#25 + PU L#25 (P#6) L2 L#26 (512KB) + L1 L#26 (64KB) + Core L#26 + PU L#26 (P#10) L2 L#27 (512KB) + L1 L#27 (64KB) + Core L#27 + PU L#27 (P#14) L2 L#28 (512KB) + L1 L#28 (64KB) + Core L#28 + PU L#28 (P#18) L2 L#29 (512KB) + L1 L#29 (64KB) + Core L#29 + PU L#29 (P#22) NUMANode L#5 (P#5 16GB) + L3 L#5 (5118KB) L2 L#30 (512KB) + L1 L#30 (64KB) + Core L#30 + PU L#30 (P#26) L2 L#31 (512KB) + L1 L#31 (64KB) + Core L#31 + PU L#31 (P#30) L2 L#32 (512KB) + L1 L#32 (64KB) + Core L#32 + PU L#32 (P#34) L2 L#33 (512KB) + L1 L#33 (64KB) + Core L#33 + PU L#33 (P#38) L2 L#34 (512KB) + L1 L#34 (64KB) + Core L#34 + PU L#34 (P#42) L2 L#35 (512KB) + L1 L#35 (64KB) + Core L#35 + PU L#35 (P#46) Socket L#3 (32GB) NUMANode L#6 (P#6 16GB) + L3 L#6 (5118KB) L2 L#36 (512KB) + L1 L#36 (64KB) + Core L#36 + PU L#36 (P#3) L2 L#37 (512KB) + L1 L#37 (64KB) + Core L#37 + PU L#37 (P#7) L2 L#38 (512KB) + L1 L#38 (64KB) + Core L#38 + PU L#38 (P#11) L2 L#39 (512KB) + L1 L#39 (64KB) + Core L#39 + PU L#39 (P#15) L2 L#40 (512KB) + L1 L#40 (64KB) + Core L#40 + PU L#40 (P#19) L2 L#41 (512KB) + L1 L#41 (64KB) + Core L#41 + PU L#41 (P#23) NUMANode L#7 (P#7 16GB) + L3 L#7 (5118KB) L2 L#42 (512KB) + L1 L#42 (64KB) + Core L#42 + PU L#42 (P#27) L2 L#43 (512KB) + L1 L#43 (64KB) + Core L#43 + PU L#43 (P#31) L2 L#44 (512KB) + L1 L#44 (64KB) + Core L#44 + PU L#44 (P#35) L2 L#45 (512KB) + L1 L#45 (64KB) + Core L#45 + PU L#45 (P#39) L2 L#46 (512KB) + L1 L#46 (64KB) + Core L#46 + PU L#46 (P#43) L2 L#47 (512KB) + L1 L#47 (64KB) + Core L#47 + PU L#47 (P#47) I guess the non-sequential core numbering is correct? -----Original Message----- From: torqueusers-bounces at supercluster.org on behalf of David Beer Sent: Fri 9/30/2011 17:14 To: Torque Users Mailing List Subject: Re: [torqueusers] numa problems ----- Original Message ----- > Hello everyone! > I sent this message before, but i don't know if it arrived correctly, > so i'll try again. (sorry if this is a dupe) > > > we're just starting out with torque, but we've run into a problem. We > have a 48-core AMD system (4 sockets with 12 cores each). The linux > system sees this as 8 nodes with 6 cores each. > I've tried compiling torque 3.02 with --enable-cpuset and > --enable-numa-support. (i also tried without cpuset, but the result > was > the same, i even got an error telling me i had to mount /dev/cpuset, > even without this switch???). Numa support uses cpusets for its implementation, so yes, you'll get the same result whether or not you use the --enable-cpuset switch. You will definitely need to mount cpusets in order to get things working. > Anyway, our mom.layout looks like this: > > cpus=0,4,8,12,16,20 mem=0 > cpus=24,28,32,36,40,44 mem=1 > cpus=1,5,9,13,17,21 mem=2 > cpus=25,29,33,37,31,45 mem=3 > cpus=2,6,10,14,18,22 mem=4 > cpus=26,30,34,38,42,46 mem=5 > cpus=3,7,11,15,19,23 mem=6 > cpus=27,31,35,39,43,47 mem=7 > > it's a bit strange, but this is how it's reported by linux. > When i start a job with these parameters: > > #PBS -N JobMPI > #PBS -l nodes=1:ppn=4 > #PBS -m abe > > It starts 4 processes in a really weird way. Sometimes he uses core > 0,1,2,3, sometimes 2 processes get run on one core, then it jumps to > core 24, etc. > the system takes a big performance hit when the processes aren't run > on > the cores sharing the same memory, so we want to lock the tasks on > the > same node. > > What am i doing wrong? I second Chris's suggestion - please send in the output of lstopo and we'll see what to do from there. I do wonder about your ordering - I'm not sure that TORQUE 3.0.* is well-equipped to handle a system with that kind of layout, but send in your lstopo output and we'll help you as much as we can. -- David Beer Direct Line: 801-717-3386 | Fax: 801-717-3738 Adaptive Computing 1656 S. East Bay Blvd. Suite #300 Provo, UT 84606 _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/ms-tnef Size: 5104 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20111003/603a271e/attachment-0001.bin From stijn.deweirdt at ugent.be Mon Oct 3 00:42:36 2011 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Mon, 03 Oct 2011 08:42:36 +0200 Subject: [torqueusers] numa problems In-Reply-To: <9EBFECC459B64D448BC1AF3A7CFAB40588CE85@imdc-mail.imdc.local> References: <6fe1606a-e98b-4d30-9093-8e2d1d0a3bad@mail> <9EBFECC459B64D448BC1AF3A7CFAB40588CE85@imdc-mail.imdc.local> Message-ID: <4E89595C.2070101@ugent.be> and check for a bios update for your motherboard (or complain to your vendor) to get the numbering "fixed". stijn > Thanks for the input guys! > Running lstopo i get the following output: > > Machine (126GB) > Socket L#0 (32GB) > NUMANode L#0 (P#0 16GB) + L3 L#0 (5118KB) > L2 L#0 (512KB) + L1 L#0 (64KB) + Core L#0 + PU L#0 (P#0) > L2 L#1 (512KB) + L1 L#1 (64KB) + Core L#1 + PU L#1 (P#4) > L2 L#2 (512KB) + L1 L#2 (64KB) + Core L#2 + PU L#2 (P#8) > L2 L#3 (512KB) + L1 L#3 (64KB) + Core L#3 + PU L#3 (P#12) > L2 L#4 (512KB) + L1 L#4 (64KB) + Core L#4 + PU L#4 (P#16) > L2 L#5 (512KB) + L1 L#5 (64KB) + Core L#5 + PU L#5 (P#20) > NUMANode L#1 (P#1 16GB) + L3 L#1 (5118KB) > L2 L#6 (512KB) + L1 L#6 (64KB) + Core L#6 + PU L#6 (P#24) > L2 L#7 (512KB) + L1 L#7 (64KB) + Core L#7 + PU L#7 (P#28) > L2 L#8 (512KB) + L1 L#8 (64KB) + Core L#8 + PU L#8 (P#32) > L2 L#9 (512KB) + L1 L#9 (64KB) + Core L#9 + PU L#9 (P#36) > L2 L#10 (512KB) + L1 L#10 (64KB) + Core L#10 + PU L#10 (P#40) > L2 L#11 (512KB) + L1 L#11 (64KB) + Core L#11 + PU L#11 (P#44) > Socket L#1 (32GB) > NUMANode L#2 (P#2 16GB) + L3 L#2 (5118KB) > L2 L#12 (512KB) + L1 L#12 (64KB) + Core L#12 + PU L#12 (P#1) > L2 L#13 (512KB) + L1 L#13 (64KB) + Core L#13 + PU L#13 (P#5) > L2 L#14 (512KB) + L1 L#14 (64KB) + Core L#14 + PU L#14 (P#9) > L2 L#15 (512KB) + L1 L#15 (64KB) + Core L#15 + PU L#15 (P#13) > L2 L#16 (512KB) + L1 L#16 (64KB) + Core L#16 + PU L#16 (P#17) > L2 L#17 (512KB) + L1 L#17 (64KB) + Core L#17 + PU L#17 (P#21) > NUMANode L#3 (P#3 16GB) + L3 L#3 (5118KB) > L2 L#18 (512KB) + L1 L#18 (64KB) + Core L#18 + PU L#18 (P#25) > L2 L#19 (512KB) + L1 L#19 (64KB) + Core L#19 + PU L#19 (P#29) > L2 L#20 (512KB) + L1 L#20 (64KB) + Core L#20 + PU L#20 (P#33) > L2 L#21 (512KB) + L1 L#21 (64KB) + Core L#21 + PU L#21 (P#37) > L2 L#22 (512KB) + L1 L#22 (64KB) + Core L#22 + PU L#22 (P#41) > L2 L#23 (512KB) + L1 L#23 (64KB) + Core L#23 + PU L#23 (P#45) > Socket L#2 (32GB) > NUMANode L#4 (P#4 16GB) + L3 L#4 (5118KB) > L2 L#24 (512KB) + L1 L#24 (64KB) + Core L#24 + PU L#24 (P#2) > L2 L#25 (512KB) + L1 L#25 (64KB) + Core L#25 + PU L#25 (P#6) > L2 L#26 (512KB) + L1 L#26 (64KB) + Core L#26 + PU L#26 (P#10) > L2 L#27 (512KB) + L1 L#27 (64KB) + Core L#27 + PU L#27 (P#14) > L2 L#28 (512KB) + L1 L#28 (64KB) + Core L#28 + PU L#28 (P#18) > L2 L#29 (512KB) + L1 L#29 (64KB) + Core L#29 + PU L#29 (P#22) > NUMANode L#5 (P#5 16GB) + L3 L#5 (5118KB) > L2 L#30 (512KB) + L1 L#30 (64KB) + Core L#30 + PU L#30 (P#26) > L2 L#31 (512KB) + L1 L#31 (64KB) + Core L#31 + PU L#31 (P#30) > L2 L#32 (512KB) + L1 L#32 (64KB) + Core L#32 + PU L#32 (P#34) > L2 L#33 (512KB) + L1 L#33 (64KB) + Core L#33 + PU L#33 (P#38) > L2 L#34 (512KB) + L1 L#34 (64KB) + Core L#34 + PU L#34 (P#42) > L2 L#35 (512KB) + L1 L#35 (64KB) + Core L#35 + PU L#35 (P#46) > Socket L#3 (32GB) > NUMANode L#6 (P#6 16GB) + L3 L#6 (5118KB) > L2 L#36 (512KB) + L1 L#36 (64KB) + Core L#36 + PU L#36 (P#3) > L2 L#37 (512KB) + L1 L#37 (64KB) + Core L#37 + PU L#37 (P#7) > L2 L#38 (512KB) + L1 L#38 (64KB) + Core L#38 + PU L#38 (P#11) > L2 L#39 (512KB) + L1 L#39 (64KB) + Core L#39 + PU L#39 (P#15) > L2 L#40 (512KB) + L1 L#40 (64KB) + Core L#40 + PU L#40 (P#19) > L2 L#41 (512KB) + L1 L#41 (64KB) + Core L#41 + PU L#41 (P#23) > NUMANode L#7 (P#7 16GB) + L3 L#7 (5118KB) > L2 L#42 (512KB) + L1 L#42 (64KB) + Core L#42 + PU L#42 (P#27) > L2 L#43 (512KB) + L1 L#43 (64KB) + Core L#43 + PU L#43 (P#31) > L2 L#44 (512KB) + L1 L#44 (64KB) + Core L#44 + PU L#44 (P#35) > L2 L#45 (512KB) + L1 L#45 (64KB) + Core L#45 + PU L#45 (P#39) > L2 L#46 (512KB) + L1 L#46 (64KB) + Core L#46 + PU L#46 (P#43) > L2 L#47 (512KB) + L1 L#47 (64KB) + Core L#47 + PU L#47 (P#47) > > I guess the non-sequential core numbering is correct? > > > -----Original Message----- > From: torqueusers-bounces at supercluster.org on behalf of David Beer > Sent: Fri 9/30/2011 17:14 > To: Torque Users Mailing List > Subject: Re: [torqueusers] numa problems > > > > ----- Original Message ----- >> Hello everyone! >> I sent this message before, but i don't know if it arrived correctly, >> so i'll try again. (sorry if this is a dupe) >> >> >> we're just starting out with torque, but we've run into a problem. We >> have a 48-core AMD system (4 sockets with 12 cores each). The linux >> system sees this as 8 nodes with 6 cores each. >> I've tried compiling torque 3.02 with --enable-cpuset and >> --enable-numa-support. (i also tried without cpuset, but the result >> was >> the same, i even got an error telling me i had to mount /dev/cpuset, >> even without this switch???). > > Numa support uses cpusets for its implementation, so yes, you'll get the same result whether or not you use the --enable-cpuset switch. You will definitely need to mount cpusets in order to get things working. > >> Anyway, our mom.layout looks like this: >> >> cpus=0,4,8,12,16,20 mem=0 >> cpus=24,28,32,36,40,44 mem=1 >> cpus=1,5,9,13,17,21 mem=2 >> cpus=25,29,33,37,31,45 mem=3 >> cpus=2,6,10,14,18,22 mem=4 >> cpus=26,30,34,38,42,46 mem=5 >> cpus=3,7,11,15,19,23 mem=6 >> cpus=27,31,35,39,43,47 mem=7 >> >> it's a bit strange, but this is how it's reported by linux. >> When i start a job with these parameters: >> >> #PBS -N JobMPI >> #PBS -l nodes=1:ppn=4 >> #PBS -m abe >> >> It starts 4 processes in a really weird way. Sometimes he uses core >> 0,1,2,3, sometimes 2 processes get run on one core, then it jumps to >> core 24, etc. >> the system takes a big performance hit when the processes aren't run >> on >> the cores sharing the same memory, so we want to lock the tasks on >> the >> same node. >> >> What am i doing wrong? > > I second Chris's suggestion - please send in the output of lstopo and we'll see what to do from there. I do wonder about your ordering - I'm not sure that TORQUE 3.0.* is well-equipped to handle a system with that kind of layout, but send in your lstopo output and we'll help you as much as we can. > > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From Gareth.Williams at csiro.au Mon Oct 3 05:08:28 2011 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Mon, 3 Oct 2011 22:08:28 +1100 Subject: [torqueusers] limit the number of jobs a user can submit In-Reply-To: <20110930184353.GF21971@stikine.sfu.ca> References: <20110930184353.GF21971@stikine.sfu.ca> Message-ID: <007DECE986B47F4EABF823C1FBB19C620102B6D6AE7C@exvic-mbx04.nexus.csiro.au> > -----Original Message----- > From: Martin Siegert [mailto:siegert at sfu.ca] > Sent: Saturday, 1 October 2011 4:44 AM > To: Torque Users Mailing List > Subject: [torqueusers] limit the number of jobs a user can submit > > Hi, > > I know this has been discussed before, but I believe an important > aspect has been overlooked: > > Moab has a limit on the number of jobs it can handle: the MAXJOB > parameter: > > "Specifies the maximum number of simultaneous jobs which can be > evaluated > by the scheduler. If additional jobs are submitted to the resource > manager, > Moab will ignore these jobs until previously submitted jobs complete." > > This allows for a trivial denial-of-service attack: > Simply submit a job array with at least MAXJOB+1 elements. > > After that moab will disregard all further jobs for scheduling > even if they have a much higher priority than the array job elements. > > I have not yet found a way of preventing this DoS attack. > The most logical solution to me would be to expand the > "max_user_queuable" > specification to allow for a server wide setting, not just a per > queue setting, i.e., > > set server max_user_queuable = 1000 > > Is that a feasible solution? > (and, yes, I'd like this limit to be in torque and not in moab because > the user will get an immediate response from qsub). > > Cheers, > Martin > > -- > Martin Siegert > Simon Fraser University Hi Martin, We were bitten by this last week (for the first time ever that we know of) and increased MAXJOB. I think using a combination of routing queues and execution queues with max_user_queueable should work. That way a user can only deny service to themself. This solution is advocated here: http://www.clusterresources.com/pipermail/torqueusers/2007-August/005922.html A recent query has more detail but unfortunately was unanswered: http://www.clusterresources.com/pipermail/torqueusers/2011-September/013339.html I'd like to try this setup but don't want the dependency problems. Perhaps it can work for you if you increase MAXJOB (say 40k) and set max_user_queueable modestly (say 1000). With those numbers you wouldn't get problems until 40 users submitted 1000 jobs each assuming they don't use multiple queues. Cheers, Gareth Note. If there were to be a server-wide max_user_queueable, that would imply that jobs could not go into routing queues as well as execution queues. From vlad at cosy.sbg.ac.at Mon Oct 3 07:27:40 2011 From: vlad at cosy.sbg.ac.at (vlad at cosy.sbg.ac.at) Date: Mon, 3 Oct 2011 15:27:40 +0200 Subject: [torqueusers] Jobs in queue do not get started Message-ID: Hi! We have setup a queue,are even able to queue MPI jobs, which get processed when the resources at first are all free. Submitting then more jobs as the system is capable to process at once, the jobs get queued (which is good and the purpose of all that), but.. they never leave the queue and are not processed. In the meantime the nodes get free and idle and are twisting their thumbs.. As I already wrote in previous mails we have Torque 3.0.3 snapshot and Maui version 3.3.1 running. ... so when "qstat" shows me that job #XYZ is "Q" it stays that way until the end of time... tracejob shows the activity of the jobs, but it reveals no error messages... Eventually I had them deleted with qdel and they were removed. Our pbs_server configuration : qmgr -c 'p s' # # Create queues and set their attributes. # # # Create and define queue gpushort # create queue gpushort set queue gpushort queue_type = Execution set queue gpushort resources_min.nodes = 1 set queue gpushort resources_default.neednodes = gpunode set queue gpushort resources_default.nodes = 1 set queue gpushort resources_default.walltime = 24:00:00 set queue gpushort enabled = True set queue gpushort started = True # ... (more queues to follow, all unused at the moment ..) (pbs_server configuration:) # Set server attributes. # set server scheduling = True set server acl_hosts = gpu set server managers = forsthof at gpu set server managers += peter at gpu set server managers += root at gpu set server managers += vlad at gpu set server operators = forsthof at gpu set server operators += peter at gpu set server operators += root at gpu set server operators += vlad at gpu set server default_queue = batch set server log_events = 511 set server mail_from = adm set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 6 set server log_level = 7 set server mom_job_sync = True set server keep_completed = 300 set server next_job_number = 247 Our submit script .. #!/bin/sh # #This is an example script example.sh # #These commands set up the Grid Environment for your job: #PBS -N CPI #PBS -l nodes=7 #PBS -V #PBS -q gpushort # #PBS -m abe np=$(cat $PBS_NODEFILE | wc -l) #print the time and date date echo "$PBS_NODEFILE" cat $PBS_NODEFILE date >> /tmp/start.txt echo "/usr/mpi/gcc/openmpi-1.4.3/bin/mpirun --hostfile $PBS_NODEFILE /home/user r/queuing/cpi" >> /tmp/start.txt /usr/mpi/gcc/openmpi-1.4.3/bin/mpirun -n 28 --hostfile $PBS_NODEFILE /home/user/queuing/integrate_queued 100000000 100000 Funny: Even though only 7 nodes are requested the 28 processes are started well until the ressources are exhausted. After that they get queued forever.. and are never started again .. Any clues ? Greetings, Vlad Popa From siegert at sfu.ca Mon Oct 3 11:07:11 2011 From: siegert at sfu.ca (Martin Siegert) Date: Mon, 3 Oct 2011 10:07:11 -0700 Subject: [torqueusers] limit the number of jobs a user can submit In-Reply-To: <007DECE986B47F4EABF823C1FBB19C620102B6D6AE7C@exvic-mbx04.nexus.csiro.au> References: <20110930184353.GF21971@stikine.sfu.ca> <007DECE986B47F4EABF823C1FBB19C620102B6D6AE7C@exvic-mbx04.nexus.csiro.au> Message-ID: <20111003170711.GD4279@stikine.sfu.ca> Hi Gareth, On Mon, Oct 03, 2011 at 10:08:28PM +1100, Gareth.Williams at csiro.au wrote: > > -----Original Message----- > > From: Martin Siegert [mailto:siegert at sfu.ca] > > Sent: Saturday, 1 October 2011 4:44 AM > > To: Torque Users Mailing List > > Subject: [torqueusers] limit the number of jobs a user can submit > > > > Hi, > > > > I know this has been discussed before, but I believe an important > > aspect has been overlooked: > > > > Moab has a limit on the number of jobs it can handle: the MAXJOB > > parameter: > > > > "Specifies the maximum number of simultaneous jobs which can be > > evaluated > > by the scheduler. If additional jobs are submitted to the resource > > manager, > > Moab will ignore these jobs until previously submitted jobs complete." > > > > This allows for a trivial denial-of-service attack: > > Simply submit a job array with at least MAXJOB+1 elements. > > > > After that moab will disregard all further jobs for scheduling > > even if they have a much higher priority than the array job elements. > > > > I have not yet found a way of preventing this DoS attack. > > The most logical solution to me would be to expand the > > "max_user_queuable" > > specification to allow for a server wide setting, not just a per > > queue setting, i.e., > > > > set server max_user_queuable = 1000 > > > > Is that a feasible solution? > > (and, yes, I'd like this limit to be in torque and not in moab because > > the user will get an immediate response from qsub). > > > > Cheers, > > Martin > > > > -- > > Martin Siegert > > Simon Fraser University > > Hi Martin, > > We were bitten by this last week (for the first time ever that we know > of) and increased MAXJOB. I think using a combination of routing queues > and execution queues with max_user_queueable should work. That way a user > can only deny service to themself. This solution is advocated here: http://www.clusterresources.com/pipermail/torqueusers/2007-August/005922.html > A recent query has more detail but unfortunately was unanswered: http://www.clusterresources.com/pipermail/torqueusers/2011-September/013339.html > I'd like to try this setup but don't want the dependency problems. We've been bitten by this now at least three times and increased MAXJOB already to 8192. I am very reluctant to go any higher: moab is already using more than 8GB of address space (I suspect mostly because of our complicated fstree structure) and increasing MAXJOB can only make this worse. I'd like to try the second setup - is there some documentation on what route_held_jobs=False actually does? And, yes, the dependency problem is ugly. > Perhaps it can work for you if you increase MAXJOB (say 40k) and set > max_user_queueable modestly (say 1000). With those numbers you wouldn't > get problems until 40 users submitted 1000 jobs each assuming they don't > use multiple queues. This could work - in all cases we have seen so far it was a single user who took down the system by sibmitting huge array jobs. However, my limits would have to be much smaller. The one time I tried this I was not sure whether moab was treating jobs in routing queues correctly - I need to test this again. > Cheers, > > Gareth > > Note. If there were to be a server-wide max_user_queueable, that would > imply that jobs could not go into routing queues as well as execution > queues. Thanks for your suggestions. But my intent was indeed to not have jobs sitting in routing queues - in our case jobs would actually pile up in two routing queues default -> exec1, exec2, route2 route2 -> exec3, exec4, exec5 Thus, jobs destined for exec3 would pile up in route2 AND default, which can only make the dependency problem worse. Thus, yes, I'd like to see a server-wide max_user_queueable for exactly the reason that I do not want jobs to pile up in routing queues. Cheers, Martin From nucci at arl.psu.edu Mon Oct 3 12:55:58 2011 From: nucci at arl.psu.edu (Jeffrey J. Nucciarone) Date: Mon, 3 Oct 2011 14:55:58 -0400 Subject: [torqueusers] disable -I option to qsub Message-ID: Is there an easy way to disable interactive shells qsub -I) either via pam rules or changes to torque config? I need to find a way to keep users from easily obtaining a command prompt. Thanks, --Jeff -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111003/15f811fb/attachment-0001.html From janusz.mordarski at uj.edu.pl Mon Oct 3 14:28:01 2011 From: janusz.mordarski at uj.edu.pl (Janusz Mordarski) Date: Mon, 03 Oct 2011 22:28:01 +0200 Subject: [torqueusers] disable -I option to qsub In-Reply-To: References: Message-ID: <4E8A1AD1.9090505@uj.edu.pl> And I am interested in enabling it! so hope there will be answer to your and my question. In my torque 2.4.8 installation somehow interactive job submiting is disabled. On 10/03/2011 08:55 PM, Jeffrey J. Nucciarone wrote: > > Is there an easy way to disable interactive shells qsub --I) either > via pam rules or changes to torque config? I need to find a way to > keep users from easily obtaining a command prompt. > > Thanks, > > --Jeff > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -- Dept of Computational Biophysics and Bioinformatics, Faculty of Biochemistry, Biophysics and Biotechnology, Jagiellonian University, ul. Gronostajowa 7, 30-387 Krakow, Poland. Tel: (+48-12)-664-6380 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111003/fa5d3add/attachment.html From dbeer at adaptivecomputing.com Mon Oct 3 14:30:22 2011 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 03 Oct 2011 14:30:22 -0600 (MDT) Subject: [torqueusers] disable -I option to qsub In-Reply-To: Message-ID: ----- Original Message ----- > > > > > Is there an easy way to disable interactive shells qsub ?I) either > via pam rules or changes to torque config? I need to find a way to > keep users from easily obtaining a command prompt. > > > > Thanks, > > > > --Jeff The easiest way to do this would be to use a qsub filter: http://www.adaptivecomputing.com/resources/docs/torque/a.jqsubwrapper.php You can take the sample, add code to check for a -I, and then reject the job if present. -- David Beer Direct Line: 801-717-3386 | Fax: 801-717-3738 Adaptive Computing 1656 S. East Bay Blvd. Suite #300 Provo, UT 84606 From sm4082 at nyu.edu Mon Oct 3 14:32:59 2011 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Mon, 3 Oct 2011 16:32:59 -0400 Subject: [torqueusers] disable -I option to qsub In-Reply-To: References: Message-ID: Hi Jeff, You can do this through qsub wrapper. Wrapper can check for arguments of qsub and if if finds -I then it should reject administratively (exit -1). This might not be the best solution but it works. If you want I can send you some simple script that could do it. Sreedhar. On Oct 3, 2011, at 2:55 PM, Jeffrey J. Nucciarone wrote: > Is there an easy way to disable interactive shells qsub ?I) either via pam rules or changes to torque config? I need to find a way to keep users from easily obtaining a command prompt. > > Thanks, > > --Jeff > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111003/29dfbe88/attachment.html From glen.beane at gmail.com Mon Oct 3 14:39:51 2011 From: glen.beane at gmail.com (Glen Beane) Date: Mon, 3 Oct 2011 16:39:51 -0400 Subject: [torqueusers] disable -I option to qsub In-Reply-To: References: Message-ID: <1E2727D0-32E4-4636-9BF5-8DEA3346EE6E@gmail.com> In qmgr set queue queue_name disallowed_types += interactive On Oct 3, 2011, at 2:55 PM, "Jeffrey J. Nucciarone" wrote: > Is there an easy way to disable interactive shells qsub ?I) either via pam rules or changes to torque config? I need to find a way to keep users from easily obtaining a command prompt. > > > > Thanks, > > > > --Jeff > > > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111003/047a6518/attachment.html From glen.beane at gmail.com Mon Oct 3 14:43:14 2011 From: glen.beane at gmail.com (Glen Beane) Date: Mon, 3 Oct 2011 16:43:14 -0400 Subject: [torqueusers] disable -I option to qsub In-Reply-To: References: Message-ID: On Mon, Oct 3, 2011 at 4:30 PM, David Beer wrote: > > > ----- Original Message ----- >> >> >> >> >> Is there an easy way to disable interactive shells qsub ?I) either >> via pam rules or changes to torque config? I need to find a way to >> keep users from easily obtaining a command prompt. >> >> >> >> Thanks, >> >> >> >> --Jeff > > The easiest way to do this would be to use a qsub filter: ?http://www.adaptivecomputing.com/resources/docs/torque/a.jqsubwrapper.php > > You can take the sample, add code to check for a -I, and then reject the job if present. I would say in this case the easiest way would be to use the disallowed_types queue attribute, which was designed specifically for this use case (as well as for sites that wanted to disallow non rerunnable jobs from a queue where they could be preempted) from the pbs_queue_attributes man page: disallowed_types List of job "types" (interactive,batch,rerunable,nonrerunable,fault_tolerant,fault_intolerant) that are not allowed in this queue. default value: all types allowed. From nucci at arl.psu.edu Mon Oct 3 15:35:49 2011 From: nucci at arl.psu.edu (Jeffrey J. Nucciarone) Date: Mon, 3 Oct 2011 17:35:49 -0400 Subject: [torqueusers] disable -I option to qsub In-Reply-To: <1E2727D0-32E4-4636-9BF5-8DEA3346EE6E@gmail.com> References: <1E2727D0-32E4-4636-9BF5-8DEA3346EE6E@gmail.com> Message-ID: This one did the trick, and I got the pam rules set to prevent user logins. Thanks so much for the help!! --Jeff Jeff Nucciarone Linux Systems Administrator Computer Security Dept, C&N Division Applied Research Laboratory PO Box 30 State College, PA 16804-0030 ------------------------------------ (814)867-3406 From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Glen Beane Sent: Monday, October 03, 2011 4:40 PM To: Torque Users Mailing List Cc: torqueusers at supercluster.org Subject: Re: [torqueusers] disable -I option to qsub In qmgr set queue queue_name disallowed_types += interactive On Oct 3, 2011, at 2:55 PM, "Jeffrey J. Nucciarone" > wrote: Is there an easy way to disable interactive shells qsub ?I) either via pam rules or changes to torque config? I need to find a way to keep users from easily obtaining a command prompt. Thanks, --Jeff _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111003/34fd76df/attachment-0001.html From northrup at scinet.utoronto.ca Mon Oct 3 13:09:15 2011 From: northrup at scinet.utoronto.ca (Scott Northrup) Date: Mon, 03 Oct 2011 15:09:15 -0400 Subject: [torqueusers] disable -I option to qsub In-Reply-To: References: Message-ID: <4E8A085B.9080504@scinet.utoronto.ca> Jeff, You can set a qmgr parameter per queue qmgr -c "set queue batch disallowed_types = interactive" Scott Jeffrey J. Nucciarone wrote: > Is there an easy way to disable interactive shells qsub ?I) either via > pam rules or changes to torque config? I need to find a way to keep > users from easily obtaining a command prompt. > > > > Thanks, > > > > --Jeff > > > > > > > ------------------------------------------------------------------------ > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -- Scott Northrup 416-978-2753 HPC Analyst northrup at scinet.utoronto.ca SciNet HPC Consortium www.scinet.utoronto.ca Compute/Calcul Canada www.computecanada.org From samuel at unimelb.edu.au Mon Oct 3 22:12:05 2011 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Tue, 04 Oct 2011 15:12:05 +1100 Subject: [torqueusers] numa problems In-Reply-To: <9EBFECC459B64D448BC1AF3A7CFAB40588CE85@imdc-mail.imdc.local> References: <6fe1606a-e98b-4d30-9093-8e2d1d0a3bad@mail> <9EBFECC459B64D448BC1AF3A7CFAB40588CE85@imdc-mail.imdc.local> Message-ID: <4E8A8795.3010408@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 03/10/11 17:29, Wannes Van Causbroeck wrote: > I guess the non-sequential core numbering is correct? Numbering is system (and even firmware) dependent, that's why hwloc generates logical numbering (the L# ones) to provide a consistent view of the hardware. Another reason I nag the Torque developers about folding in hwloc support into Torque.. ;-) cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk6Kh5UACgkQO2KABBYQAh8YfwCfYXRfaZ5FUR+efsgNfLgm83lI H9sAn2pNqOSqN50xdxkt1DJvPhxcDlq3 =l4Xa -----END PGP SIGNATURE----- From samuel at unimelb.edu.au Mon Oct 3 22:15:24 2011 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Tue, 04 Oct 2011 15:15:24 +1100 Subject: [torqueusers] limit the number of jobs a user can submit In-Reply-To: <20110930184353.GF21971@stikine.sfu.ca> References: <20110930184353.GF21971@stikine.sfu.ca> Message-ID: <4E8A885C.6090201@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 01/10/11 04:43, Martin Siegert wrote: > I have not yet found a way of preventing this DoS attack. > The most logical solution to me would be to expand the "max_user_queuable" > specification to allow for a server wide setting, not just a per > queue setting, i.e., We just set that for both our queues. Works well. cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk6KiFwACgkQO2KABBYQAh/ouACghKCjj8p5jaNL/V9U2igC4BYU uhAAnRmzphEqnF5/M+3KlOWftbFM/E57 =6G3R -----END PGP SIGNATURE----- From samuel at unimelb.edu.au Mon Oct 3 22:17:45 2011 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Tue, 04 Oct 2011 15:17:45 +1100 Subject: [torqueusers] Jobs in queue do not get started In-Reply-To: References: Message-ID: <4E8A88E9.9010105@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 04/10/11 00:27, vlad at cosy.sbg.ac.at wrote: > As I already wrote in previous mails we have > Torque 3.0.3 snapshot and Maui version 3.3.1 running. If there's no evidence in the server_logs of the job trying to be started and failing then my wild guess would be that this is a Maui issue, have you tried the mauiusers mailing list ? cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk6KiOkACgkQO2KABBYQAh8dcwCeKZWJE7pQnjLcygt5PB8KJYr9 dtQAnRQQCu4pX1x6gBSZjCGby7suNxZi =uB51 -----END PGP SIGNATURE----- From wannes.van.causbroeck at imdc.be Tue Oct 4 02:24:09 2011 From: wannes.van.causbroeck at imdc.be (Wannes Van Causbroeck) Date: Tue, 4 Oct 2011 10:24:09 +0200 Subject: [torqueusers] numa problems In-Reply-To: <4E8A8795.3010408@unimelb.edu.au> References: <6fe1606a-e98b-4d30-9093-8e2d1d0a3bad@mail> <9EBFECC459B64D448BC1AF3A7CFAB40588CE85@imdc-mail.imdc.local> <4E8A8795.3010408@unimelb.edu.au> Message-ID: <9EBFECC459B64D448BC1AF3A7CFAB40588CE9E@imdc-mail.imdc.local> As soon as the machine is idle i'll try messing around with some bios options. It's a dell poweredge R815, nothing that fancy. I'll try harassing dell technical support as well. On 10/04/2011 06:12 AM, Christopher Samuel wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 03/10/11 17:29, Wannes Van Causbroeck wrote: > >> I guess the non-sequential core numbering is correct? > Numbering is system (and even firmware) dependent, that's > why hwloc generates logical numbering (the L# ones) to > provide a consistent view of the hardware. > > Another reason I nag the Torque developers about folding in > hwloc support into Torque.. ;-) > > cheers, > Chris > - -- > Christopher Samuel - Senior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.unimelb.edu.au/ > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.11 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iEYEARECAAYFAk6Kh5UACgkQO2KABBYQAh8YfwCfYXRfaZ5FUR+efsgNfLgm83lI > H9sAn2pNqOSqN50xdxkt1DJvPhxcDlq3 > =l4Xa > -----END PGP SIGNATURE----- > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From blake.wickliffe at aramco.com Tue Oct 4 06:42:04 2011 From: blake.wickliffe at aramco.com (Wickliffe, Blake W) Date: Tue, 4 Oct 2011 12:42:04 +0000 Subject: [torqueusers] Strange problem in Torque 2.5.7+ In-Reply-To: References: <7ea30c15-1dcc-439c-8232-ebd460d5159c@mail> Message-ID: <09D3C16B37878C44837F749DB16ACF1908FBC5@DHA00730-MSXP03.aramco.com> Hi, Is there any feedback from Adaptive on this? It is still biting us on a fairly regular basis... Regards, Blake Wickliffe Saudi Aramco ENOD/CSYS/USG HPC Team (873-4417) -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Charles Johnson Sent: Tuesday, September 13, 2011 8:55 PM To: Torque Users Mailing List Subject: Re: [torqueusers] Strange problem in Torque 2.5.7+ On Sep 13, 2011, at 10:08 AM, Ken Nielson wrote: > I just looked at Simon's patch and it is in the right direction. If the problem you have is that the tcp calls do not return promptly then you can be waiting a long time (10800 seconds by default). > > Regards > > Ken Nielson > Adaptive Computing FYI, we had a stoppage in the late morning, and the last 6 minutes of the torque log file was sent. I will have a look at the patch. ~Charles~ -- Charles Johnson, Vanderbilt University Advanced Computing Center for Research & Education _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers ________________________________ The contents of this email, including all related responses, files and attachments transmitted with it (collectively referred to as ?this Email?), are intended solely for the use of the individual/entity to whom/which they are addressed, and may contain confidential and/or legally privileged information. This Email may not be disclosed or forwarded to anyone else without authorization from the originator of this Email. If you have received this Email in error, please notify the sender immediately and delete all copies from your system. Please note that the views or opinions presented in this Email are those of the author and may not necessarily represent those of Saudi Aramco. The recipient should check this Email and any attachments for the presence of any viruses. Saudi Aramco accepts no liability for any damage caused by any virus/error transmitted by this Email. From dbeer at adaptivecomputing.com Tue Oct 4 09:13:11 2011 From: dbeer at adaptivecomputing.com (David Beer) Date: Tue, 04 Oct 2011 09:13:11 -0600 (MDT) Subject: [torqueusers] numa problems In-Reply-To: <4E8A8795.3010408@unimelb.edu.au> Message-ID: <7fb81539-f31c-4069-9aac-886612cc9cf5@mail> ----- Original Message ----- > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 03/10/11 17:29, Wannes Van Causbroeck wrote: > > > I guess the non-sequential core numbering is correct? > > Numbering is system (and even firmware) dependent, that's > why hwloc generates logical numbering (the L# ones) to > provide a consistent view of the hardware. > > Another reason I nag the Torque developers about folding in > hwloc support into Torque.. ;-) > Hwloc is now in TORQUE, but only for the 4.0 code. Once we put it in, everyone that wants to use cpusets has to use hwloc, and so it was decided to put it in the 4.0 code. -- David Beer Direct Line: 801-717-3386 | Fax: 801-717-3738 Adaptive Computing 1656 S. East Bay Blvd. Suite #300 Provo, UT 84606 From knielson at adaptivecomputing.com Tue Oct 4 09:23:00 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Tue, 04 Oct 2011 09:23:00 -0600 (MDT) Subject: [torqueusers] Strange problem in Torque 2.5.7+ In-Reply-To: <09D3C16B37878C44837F749DB16ACF1908FBC5@DHA00730-MSXP03.aramco.com> Message-ID: <09da36e7-62c3-4e3a-9658-c04c4554b8e4@mail> ----- Original Message ----- > From: "Blake W Wickliffe" > To: "Torque Users Mailing List" > Sent: Tuesday, October 4, 2011 6:42:04 AM > Subject: Re: [torqueusers] Strange problem in Torque 2.5.7+ > > Hi, > > Is there any feedback from Adaptive on this? It is still biting us > on a fairly regular basis... > > Regards, > > Blake Wickliffe > Saudi Aramco > ENOD/CSYS/USG HPC Team > (873-4417) > Blake, Nothing yet. Sorry that this has been put off. I will try and get back to it this week. Ken From samuel at unimelb.edu.au Tue Oct 4 22:03:30 2011 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Wed, 05 Oct 2011 15:03:30 +1100 Subject: [torqueusers] numa problems In-Reply-To: <7fb81539-f31c-4069-9aac-886612cc9cf5@mail> References: <7fb81539-f31c-4069-9aac-886612cc9cf5@mail> Message-ID: <4E8BD712.8020600@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 05/10/11 02:13, David Beer wrote: > Hwloc is now in TORQUE, but only for the 4.0 code. W00t! I completely missed that commit to trunk.. :-) > Once we put it in, everyone that wants to use cpusets > has to use hwloc, and so it was decided to put it in > the 4.0 code. Is there any chance it can be used to do control group stuff as well so you can limit a job on a node to just a particular amount of RAM without worrying about the whole RLIMIT_AS vs RLIMIT_DATA thing ? I *think* Slurm is using it for that already. cheers! Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk6L1xIACgkQO2KABBYQAh/SHQCfegypk7P5oKOndmBN2fHret0X c38AmwevhOo14Duh0gZ2zeAqOiG0Dq8H =+/r1 -----END PGP SIGNATURE----- From j.blank at fz-juelich.de Wed Oct 5 06:09:36 2011 From: j.blank at fz-juelich.de (Joerg Blank) Date: Wed, 05 Oct 2011 14:09:36 +0200 Subject: [torqueusers] Cpusets and qrun Message-ID: Hello, We run 2.5.8 with cpusets enabled. When jobs with large memory requests (which does not allow to fill all virtual processors) are launched on a node the jobs are bound to the lower core numbers. This does not play well with the NUMA architecture. Is there a way to specify the virtual processor the job should run on when using the qrun command? Regards, J?rg Blank From leggett at mcs.anl.gov Wed Oct 5 07:58:31 2011 From: leggett at mcs.anl.gov (Ti Leggett) Date: Wed, 5 Oct 2011 08:58:31 -0500 Subject: [torqueusers] 2.5.8 PAM module fails Message-ID: <38ECF3CC-9354-4548-8215-1E4997E7EE49@mcs.anl.gov> I tried upgrading to 2.5.8 from 2.5.7 yesterday and ran into a problem with pam_pbssimpleauth: Oct 4 15:05:25 c23 sshd[16909]: PAM unable to dlopen(/lib64/security/pam_pbssimpleauth.so) Oct 4 15:05:25 c23 sshd[16909]: PAM [error: /lib64/security/pam_pbssimpleauth.so: undefined symbol: getpwnam_ext] Oct 4 15:05:25 c23 sshd[16909]: PAM adding faulty module: /lib64/security/pam_pbssimpleauth.so I built 2.5.8 the same way I built 2.5.7: ./configure --host=x86_64-redhat-linux-gnu --build=x86_64-redhat-linux-gnu --target=x86_64-redhat-linux --program-prefix= --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/usr/com --mandir=/usr/share/man --infodir=/usr/share/info --disable-gcc-warnings --with-pam=/lib64/security --enable-cpuset --enable-geometry-requests --enable-blcr --enable-nvidia-gpus Rolling back to 2.5.7 fixed the problem. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 163 bytes Desc: Message signed with OpenPGP using GPGMail Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20111005/19f0f129/attachment.bin From j.blank at fz-juelich.de Wed Oct 5 08:28:30 2011 From: j.blank at fz-juelich.de (Joerg Blank) Date: Wed, 05 Oct 2011 16:28:30 +0200 Subject: [torqueusers] 2.5.8 PAM module fails In-Reply-To: <38ECF3CC-9354-4548-8215-1E4997E7EE49@mcs.anl.gov> References: <38ECF3CC-9354-4548-8215-1E4997E7EE49@mcs.anl.gov> Message-ID: Am 05.10.2011 15:58, schrieb Ti Leggett: > I built 2.5.8 the same way I built 2.5.7: Interesting that it even built. I built 2.5.8 on debian stable and had to patch away the _ext. I do not know if it works though, as I do not load the module. Regards, J?rg Blank From bdandrus at nps.edu Wed Oct 5 13:05:21 2011 From: bdandrus at nps.edu (Andrus, Brian Contractor) Date: Wed, 5 Oct 2011 19:05:21 +0000 Subject: [torqueusers] job dependent upon array stuck in UserHold state Message-ID: All, We are trying to run a job after an array job completes successfully. jid=$(echo 'echo $PBS_ARRAYID;sleep 60'|qsub -l walltime=00:01:00 -l procs=2 -t 0-10);echo 'hostname'|qsub -l walltime=00:01:00 -l procs=2 -W depend=afterokarray:$jid[] The array job completes, but the dependent job ends up in a 'UserHold' state. Shouldn't the dependent job move from the UserHold state to no hold and eligible once the array jobs complete?? Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111005/587d2787/attachment.html From ferraria at gmail.com Thu Oct 6 05:51:05 2011 From: ferraria at gmail.com (Anthony Ferrari) Date: Thu, 6 Oct 2011 13:51:05 +0200 Subject: [torqueusers] Upgrading to a new Torque release - Maui compatibility Message-ID: Hi all, We have a linux computing cluster of 64 nodes and we use Torque 2.4.6 with Maui 3.3. We would like to upgrade Torque to a version 2.5.x to benefit from some of the useful features it has compared to the 2.4.x. But we have been told that combining different versions of Torque with different versions of Maui is not always a success. The couple (Torque 2.4.6, Maui 3.3) seems to work well. So we are afraid to upgrade Torque to a 2.5.x version and then face some bugs. Do you have any useful thoughts on this ? In your own systems, do you have a Torque2.5.x/Maui combination that you are happy with ? What combination do you advise ? Many thanks, Anthony -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111006/8de9cb57/attachment.html From billy.lenox at us.army.mil Thu Oct 6 08:55:47 2011 From: billy.lenox at us.army.mil (Lenox, Billy AMRDEC/Sentient Corp.) Date: Thu, 06 Oct 2011 09:55:47 -0500 Subject: [torqueusers] Need help with NCPUS not working in QSUB Message-ID: I have torque setup on a head node system with 5 compute nodes Two have 8 cores and 3 have 4 cores setup into on queue called batch When I use a submit script #!/bin/bash #PBS -l ncpus=28 #PBS -l walltime=72:00:00 #PBS -o output.out #PBS -e ie.error Here /var/spool/torque/server_priv/nodes seed001 np=8 batch seed002 np=8 batch seed003 np=8 batch seed004 np=8 batch seed005 np=8 batch When I submit the script it only runs on one node SEED001 I don't know why it only runs on one node. Billy From billy.lenox at us.army.mil Thu Oct 6 10:01:22 2011 From: billy.lenox at us.army.mil (Lenox, Billy AMRDEC/Sentient Corp.) Date: Thu, 06 Oct 2011 11:01:22 -0500 Subject: [torqueusers] Need help with NCPUS not working in QSUB In-Reply-To: Message-ID: Mistake below: This is correct: Here /var/spool/torque/server_priv/nodes seed001 np=8 batch seed002 np=8 batch seed003 np=4 batch seed004 np=4 batch seed005 np=4 batch > From: "Billy D. Lenox" > Reply-To: Torque Users Mailing List > Date: Thu, 06 Oct 2011 09:55:47 -0500 > To: > Subject: [torqueusers] Need help with NCPUS not working in QSUB > > I have torque setup on a head node system with 5 compute nodes > Two have 8 cores and 3 have 4 cores setup into on queue called batch > When I use a submit script > > #!/bin/bash > #PBS -l ncpus=28 > #PBS -l walltime=72:00:00 > #PBS -o output.out > #PBS -e ie.error > > Here /var/spool/torque/server_priv/nodes > > seed001 np=8 batch > seed002 np=8 batch > seed003 np=8 batch > seed004 np=8 batch > seed005 np=8 batch > > When I submit the script it only runs on one node SEED001 > > I don't know why it only runs on one node. > > Billy > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From tbaer at utk.edu Thu Oct 6 10:07:45 2011 From: tbaer at utk.edu (Troy Baer) Date: Thu, 6 Oct 2011 12:07:45 -0400 Subject: [torqueusers] Need help with NCPUS not working in QSUB In-Reply-To: References: Message-ID: <1317917265.6010.17.camel@browncoat.jics.utk.edu> On Thu, 2011-10-06 at 09:55 -0500, Lenox, Billy AMRDEC/Sentient Corp. wrote: > I have torque setup on a head node system with 5 compute nodes > Two have 8 cores and 3 have 4 cores setup into on queue called batch > When I use a submit script > > #!/bin/bash > #PBS -l ncpus=28 > #PBS -l walltime=72:00:00 > #PBS -o output.out > #PBS -e ie.error > > Here /var/spool/torque/server_priv/nodes > > seed001 np=8 batch > seed002 np=8 batch > seed003 np=8 batch > seed004 np=8 batch > seed005 np=8 batch > > When I submit the script it only runs on one node SEED001 > > I don't know why it only runs on one node. Which scheduler are you using? In most of the TORQUE-compatible schedulers I've seen, the ncpus= resource is interpreted as how many processors you want on a single shared memory system. (If you want X processors and you don't care where they are, I think the preferred way of requesting it is procs=X.) --Troy -- Troy Baer, HPC System Administrator National Institute for Computational Sciences, University of Tennessee http://www.nics.tennessee.edu/ Phone: 865-241-4233 From knielson at adaptivecomputing.com Thu Oct 6 10:37:33 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Thu, 06 Oct 2011 10:37:33 -0600 (MDT) Subject: [torqueusers] job dependent upon array stuck in UserHold state In-Reply-To: Message-ID: ----- Original Message ----- > From: "Brian Contractor Andrus" > To: "Torque Users Mailing List (torqueusers at supercluster.org)" > Sent: Wednesday, October 5, 2011 1:05:21 PM > Subject: [torqueusers] job dependent upon array stuck in UserHold state > > > > > > All, > > > > We are trying to run a job after an array job completes successfully. > > > > jid=$(echo 'echo $PBS_ARRAYID;sleep 60'|qsub -l walltime=00:01:00 -l > procs=2 -t 0-10);echo 'hostname'|qsub -l walltime=00:01:00 -l > procs=2 -W depend=afterokarray:$jid[] > > > > The array job completes, but the dependent job ends up in a > ?UserHold? state. > > Shouldn?t the dependent job move from the UserHold state to no hold > and eligible once the array jobs complete?? > > > > > > Brian Andrus > > ITACS/Research Computing > > Naval Postgraduate School > > Monterey, California > > voice: 831-656-6238 > > Brian, I am able to reproduce this. Will you make a bugzilla ticket for this and we will fix it before 2.5.9. Ken From bdandrus at nps.edu Thu Oct 6 10:38:40 2011 From: bdandrus at nps.edu (Andrus, Brian Contractor) Date: Thu, 6 Oct 2011 16:38:40 +0000 Subject: [torqueusers] Clean up nodes showing jobs Message-ID: All, I have been unable with torque 2.5.8 to get the nodes to release non-existent jobs. When I run pbsnodes, I see MANY jobs on MANY nodes that are not actually there. I have tried momctl to clear them, pbs_mom purge, restarting pbs_mom and pbs_server to no avail. Has anyone seen this before? If so, how did it get cleaned up? No jobs will start because pbs_server thinks all the resources are in use. Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111006/374dcd2a/attachment.html From bdandrus at nps.edu Thu Oct 6 10:41:58 2011 From: bdandrus at nps.edu (Andrus, Brian Contractor) Date: Thu, 6 Oct 2011 16:41:58 +0000 Subject: [torqueusers] job dependent upon array stuck in UserHold state In-Reply-To: References: Message-ID: Done. Bug 160 - job dependent upon array stuck in UserHold state http://www.clusterresources.com/bugzilla/show_bug.cgi?id=160 Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238 -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Ken Nielson Sent: Thursday, October 06, 2011 9:38 AM To: Torque Users Mailing List Subject: Re: [torqueusers] job dependent upon array stuck in UserHold state ----- Original Message ----- > From: "Brian Contractor Andrus" > To: "Torque Users Mailing List (torqueusers at supercluster.org)" > > Sent: Wednesday, October 5, 2011 1:05:21 PM > Subject: [torqueusers] job dependent upon array stuck in UserHold > state > > > > > > All, > > > > We are trying to run a job after an array job completes successfully. > > > > jid=$(echo 'echo $PBS_ARRAYID;sleep 60'|qsub -l walltime=00:01:00 -l > procs=2 -t 0-10);echo 'hostname'|qsub -l walltime=00:01:00 -l > procs=2 -W depend=afterokarray:$jid[] > > > > The array job completes, but the dependent job ends up in a ?UserHold? > state. > > Shouldn?t the dependent job move from the UserHold state to no hold > and eligible once the array jobs complete?? > > > > > > Brian Andrus > > ITACS/Research Computing > > Naval Postgraduate School > > Monterey, California > > voice: 831-656-6238 > > Brian, I am able to reproduce this. Will you make a bugzilla ticket for this and we will fix it before 2.5.9. Ken _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From billy.lenox at us.army.mil Thu Oct 6 11:09:36 2011 From: billy.lenox at us.army.mil (Lenox, Billy AMRDEC/Sentient Corp.) Date: Thu, 06 Oct 2011 12:09:36 -0500 Subject: [torqueusers] Need help with NCPUS not working in QSUB In-Reply-To: <1317917265.6010.17.camel@browncoat.jics.utk.edu> Message-ID: Ok I tried PBS -l procs=28 and it still runs on one NODE seed001 I notice that if I put in the script on the EXEC line the location of a HOSTFILE it runs and bypasses TORQUE PBS. I just have the Default Scheduler on the System. I know I can not specify PBS -l nodes=5. I have tried different ways and still it will only run on ONE NODE seed001. Billy > From: Troy Baer > Organization: National Institute for Computational Sciences, University of > Tennessee > Reply-To: Torque Users Mailing List > Date: Thu, 6 Oct 2011 12:07:45 -0400 > To: Torque Users Mailing List > Subject: Re: [torqueusers] Need help with NCPUS not working in QSUB > > On Thu, 2011-10-06 at 09:55 -0500, Lenox, Billy AMRDEC/Sentient Corp. > wrote: >> I have torque setup on a head node system with 5 compute nodes >> Two have 8 cores and 3 have 4 cores setup into on queue called batch >> When I use a submit script >> >> #!/bin/bash >> #PBS -l ncpus=28 >> #PBS -l walltime=72:00:00 >> #PBS -o output.out >> #PBS -e ie.error >> >> Here /var/spool/torque/server_priv/nodes >> >> seed001 np=8 batch >> seed002 np=8 batch >> seed003 np=8 batch >> seed004 np=8 batch >> seed005 np=8 batch >> >> When I submit the script it only runs on one node SEED001 >> >> I don't know why it only runs on one node. > > Which scheduler are you using? In most of the TORQUE-compatible > schedulers I've seen, the ncpus= resource is interpreted as how many > processors you want on a single shared memory system. (If you want X > processors and you don't care where they are, I think the preferred way > of requesting it is procs=X.) > > --Troy > -- > Troy Baer, HPC System Administrator > National Institute for Computational Sciences, University of Tennessee > http://www.nics.tennessee.edu/ > Phone: 865-241-4233 > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From jjc at iastate.edu Thu Oct 6 14:33:01 2011 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Thu, 6 Oct 2011 15:33:01 -0500 Subject: [torqueusers] Need help with NCPUS not working in QSUB In-Reply-To: References: <1317917265.6010.17.camel@browncoat.jics.utk.edu> Message-ID: Torque and PBS give you a file named PBS_NODEFILE For example with MPIPCH you could use mpirun -np 28 -machinefile ${PBS_NODEFILE} ./prog Then 28 copies of ./prog will be started on the 28 machines listed in ${PBS_NODEFILE} Other programs like Fluent need you to specify something like: fluent 3ddp -t28 -pib -g -i Case.jou -cnf=${PBS_NODEFILE} again here you need to specify a file containing the machines on which to run each process. If you leave off the -cnf above, fluent will start all the processes on the first node that the jobs got assigned to. -----Original Message----- >From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >bounces at supercluster.org] On Behalf Of Lenox, Billy AMRDEC/Sentient >Corp. >Sent: Thursday, October 06, 2011 12:10 PM >To: Torque Users Mailing List >Subject: Re: [torqueusers] Need help with NCPUS not working in QSUB > >Ok I tried PBS -l procs=28 and it still runs on one NODE seed001 >I notice that if I put in the script on the EXEC line the location >of a >HOSTFILE it runs and bypasses TORQUE PBS. I just have the Default >Scheduler >on the System. I know I can not specify PBS -l nodes=5. >I have tried different ways and still it will only run on ONE NODE >seed001. > >Billy > >> From: Troy Baer >> Organization: National Institute for Computational Sciences, >University of >> Tennessee >> Reply-To: Torque Users Mailing List >> Date: Thu, 6 Oct 2011 12:07:45 -0400 >> To: Torque Users Mailing List >> Subject: Re: [torqueusers] Need help with NCPUS not working in >QSUB >> >> On Thu, 2011-10-06 at 09:55 -0500, Lenox, Billy AMRDEC/Sentient >Corp. >> wrote: >>> I have torque setup on a head node system with 5 compute nodes >>> Two have 8 cores and 3 have 4 cores setup into on queue called >batch >>> When I use a submit script >>> >>> #!/bin/bash >>> #PBS -l ncpus=28 >>> #PBS -l walltime=72:00:00 >>> #PBS -o output.out >>> #PBS -e ie.error >>> >>> Here /var/spool/torque/server_priv/nodes >>> >>> seed001 np=8 batch >>> seed002 np=8 batch >>> seed003 np=8 batch >>> seed004 np=8 batch >>> seed005 np=8 batch >>> >>> When I submit the script it only runs on one node SEED001 >>> >>> I don't know why it only runs on one node. >> >> Which scheduler are you using? In most of the TORQUE-compatible >> schedulers I've seen, the ncpus= resource is interpreted as how >many >> processors you want on a single shared memory system. (If you >want X >> processors and you don't care where they are, I think the >preferred way >> of requesting it is procs=X.) >> >> --Troy >> -- >> Troy Baer, HPC System Administrator >> National Institute for Computational Sciences, University of >Tennessee >> http://www.nics.tennessee.edu/ >> Phone: 865-241-4233 >> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > >_______________________________________________ >torqueusers mailing list >torqueusers at supercluster.org >http://www.supercluster.org/mailman/listinfo/torqueusers From jdsmit at sandia.gov Thu Oct 6 14:49:54 2011 From: jdsmit at sandia.gov (Jerry Smith) Date: Thu, 06 Oct 2011 14:49:54 -0600 Subject: [torqueusers] Need help with NCPUS not working in QSUB In-Reply-To: References: <1317917265.6010.17.camel@browncoat.jics.utk.edu> Message-ID: <4E8E1472.4060001@sandia.gov> If you have built mpiexec: http://www.osc.edu/~djohnson/mpiexec/index.php It is aware of $PBS_NODEFILE, and will do "the right thing", when used similarly to mpirun as mentioned by Mr. Croyle. Jerry Coyle, James J [ITACD] wrote: > Torque and PBS give you a file named > PBS_NODEFILE > > For example with MPIPCH you could use > > mpirun -np 28 -machinefile ${PBS_NODEFILE} ./prog > > Then 28 copies of ./prog will be started on > the 28 machines listed in ${PBS_NODEFILE} > > Other programs like Fluent need you to specify something like: > fluent 3ddp -t28 -pib -g -i Case.jou -cnf=${PBS_NODEFILE} > > > again here you need to specify a file containing the > machines on which to run each process. If you leave off the > -cnf above, fluent will start all the processes on > the first node that the jobs got assigned to. > > > -----Original Message----- > >> From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >> bounces at supercluster.org] On Behalf Of Lenox, Billy AMRDEC/Sentient >> Corp. >> Sent: Thursday, October 06, 2011 12:10 PM >> To: Torque Users Mailing List >> Subject: Re: [torqueusers] Need help with NCPUS not working in QSUB >> >> Ok I tried PBS -l procs=28 and it still runs on one NODE seed001 >> I notice that if I put in the script on the EXEC line the location >> of a >> HOSTFILE it runs and bypasses TORQUE PBS. I just have the Default >> Scheduler >> on the System. I know I can not specify PBS -l nodes=5. >> I have tried different ways and still it will only run on ONE NODE >> seed001. >> >> Billy >> >> >>> From: Troy Baer >>> Organization: National Institute for Computational Sciences, >>> >> University of >> >>> Tennessee >>> Reply-To: Torque Users Mailing List >>> Date: Thu, 6 Oct 2011 12:07:45 -0400 >>> To: Torque Users Mailing List >>> Subject: Re: [torqueusers] Need help with NCPUS not working in >>> >> QSUB >> >>> On Thu, 2011-10-06 at 09:55 -0500, Lenox, Billy AMRDEC/Sentient >>> >> Corp. >> >>> wrote: >>> >>>> I have torque setup on a head node system with 5 compute nodes >>>> Two have 8 cores and 3 have 4 cores setup into on queue called >>>> >> batch >> >>>> When I use a submit script >>>> >>>> #!/bin/bash >>>> #PBS -l ncpus=28 >>>> #PBS -l walltime=72:00:00 >>>> #PBS -o output.out >>>> #PBS -e ie.error >>>> >>>> Here /var/spool/torque/server_priv/nodes >>>> >>>> seed001 np=8 batch >>>> seed002 np=8 batch >>>> seed003 np=8 batch >>>> seed004 np=8 batch >>>> seed005 np=8 batch >>>> >>>> When I submit the script it only runs on one node SEED001 >>>> >>>> I don't know why it only runs on one node. >>>> >>> Which scheduler are you using? In most of the TORQUE-compatible >>> schedulers I've seen, the ncpus= resource is interpreted as how >>> >> many >> >>> processors you want on a single shared memory system. (If you >>> >> want X >> >>> processors and you don't care where they are, I think the >>> >> preferred way >> >>> of requesting it is procs=X.) >>> >>> --Troy >>> -- >>> Troy Baer, HPC System Administrator >>> National Institute for Computational Sciences, University of >>> >> Tennessee >> >>> http://www.nics.tennessee.edu/ >>> Phone: 865-241-4233 >>> >>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111006/e68cd9d9/attachment.html From knielson at adaptivecomputing.com Thu Oct 6 15:43:01 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Thu, 06 Oct 2011 15:43:01 -0600 (MDT) Subject: [torqueusers] 2.5.8 PAM module fails In-Reply-To: <38ECF3CC-9354-4548-8215-1E4997E7EE49@mcs.anl.gov> Message-ID: <382b132e-738d-49d2-a0e6-e2ed286cf59a@mail> ----- Original Message ----- > From: "Ti Leggett" > To: "Torque Users Mailing List" > Sent: Wednesday, October 5, 2011 7:58:31 AM > Subject: [torqueusers] 2.5.8 PAM module fails > > I tried upgrading to 2.5.8 from 2.5.7 yesterday and ran into a > problem with pam_pbssimpleauth: > > Oct 4 15:05:25 c23 sshd[16909]: PAM unable to > dlopen(/lib64/security/pam_pbssimpleauth.so) > Oct 4 15:05:25 c23 sshd[16909]: PAM [error: > /lib64/security/pam_pbssimpleauth.so: undefined symbol: > getpwnam_ext] > Oct 4 15:05:25 c23 sshd[16909]: PAM adding faulty module: > /lib64/security/pam_pbssimpleauth.so > > I built 2.5.8 the same way I built 2.5.7: > > ./configure --host=x86_64-redhat-linux-gnu > --build=x86_64-redhat-linux-gnu --target=x86_64-redhat-linux > --program-prefix= --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin > --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share > --includedir=/usr/include --libdir=/usr/lib64 > --libexecdir=/usr/libexec --localstatedir=/var > --sharedstatedir=/usr/com --mandir=/usr/share/man > --infodir=/usr/share/info --disable-gcc-warnings > --with-pam=/lib64/security --enable-cpuset > --enable-geometry-requests --enable-blcr --enable-nvidia-gpus > > Rolling back to 2.5.7 fixed the problem. I have tried this on CentOS 5 and Ubuntu 11.04 without any problem. I am building from the 2.5-fixes branch in Subversion. Is anyone else having an issue compiling PAM on TORQUE with 2.5.8? Maybe another good question is "is anyone using PAM with 2.5.8"? Regards Ken Nielson Adaptive Computing From mej at lbl.gov Thu Oct 6 16:12:34 2011 From: mej at lbl.gov (Michael Jennings) Date: Thu, 6 Oct 2011 15:12:34 -0700 Subject: [torqueusers] 2.5.8 PAM module fails In-Reply-To: <382b132e-738d-49d2-a0e6-e2ed286cf59a@mail> References: <38ECF3CC-9354-4548-8215-1E4997E7EE49@mcs.anl.gov> <382b132e-738d-49d2-a0e6-e2ed286cf59a@mail> Message-ID: On Oct 6, 2011 2:43 PM, "Ken Nielson" wrote: > I have tried this on CentOS 5 and Ubuntu 11.04 without any problem. I am building from the 2.5-fixes branch in Subversion. Is anyone else having an issue compiling PAM on TORQUE with 2.5.8? > Maybe another good question is "is anyone using PAM with 2.5.8"? We have successfully built 2.5.8 with PAM support on CentOS 4 and 5 (and SL6) with no problems. Michael -- Michael Jennings Linux Systems and Cluster Engineer High-Performance Computing Services Bldg 50B-3209E. W: 510-495-2687 MS 050C-3396 F: 510-486-8615 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111006/84271398/attachment-0001.html From wcurry at tulane.edu Thu Oct 6 16:11:45 2011 From: wcurry at tulane.edu (Curry, William D) Date: Thu, 6 Oct 2011 22:11:45 +0000 Subject: [torqueusers] cpuset.c sleep timer causing job deferrals with 'RM Failure' response Message-ID: Dear list, I've had trouble with jobs deferring when assigned to nodes following finished larger (many ppn) jobs. `checkjob` reports an "RM Failure". Usually after waiting a minute or two `releasehold` allows the job to run. I noticed in the mom's log that a ppn=32 job spent 32 seconds deleting task directories in /dev/cpuset after the job finished. Inside src/resmom/linux/cpuset.c there is a 'sleep(1)' command along with a FIXME note that is apparently responsible for this behavior. After removing this line and restarting the mom, my tests no longer end with deferred jobs. What are the potential pitfalls of doing this? Sincerely, -Will -- Will Curry Sr. HPC Systems Analyst Center for Computational Science Tulane University office: (504) 862-8393 -- From leggett at mcs.anl.gov Thu Oct 6 21:32:05 2011 From: leggett at mcs.anl.gov (Ti Leggett) Date: Thu, 6 Oct 2011 22:32:05 -0500 Subject: [torqueusers] 2.5.8 PAM module fails In-Reply-To: <382b132e-738d-49d2-a0e6-e2ed286cf59a@mail> References: <382b132e-738d-49d2-a0e6-e2ed286cf59a@mail> Message-ID: <97DF6624-B67C-4EE5-9E7F-4784D4B77F67@mcs.anl.gov> I was able to build it successfully, but it bails when PAM tries to load it to use it. On Oct 6, 2011, at 4:43 PM, Ken Nielson wrote: > ----- Original Message ----- >> From: "Ti Leggett" >> To: "Torque Users Mailing List" >> Sent: Wednesday, October 5, 2011 7:58:31 AM >> Subject: [torqueusers] 2.5.8 PAM module fails >> >> I tried upgrading to 2.5.8 from 2.5.7 yesterday and ran into a >> problem with pam_pbssimpleauth: >> >> Oct 4 15:05:25 c23 sshd[16909]: PAM unable to >> dlopen(/lib64/security/pam_pbssimpleauth.so) >> Oct 4 15:05:25 c23 sshd[16909]: PAM [error: >> /lib64/security/pam_pbssimpleauth.so: undefined symbol: >> getpwnam_ext] >> Oct 4 15:05:25 c23 sshd[16909]: PAM adding faulty module: >> /lib64/security/pam_pbssimpleauth.so >> >> I built 2.5.8 the same way I built 2.5.7: >> >> ./configure --host=x86_64-redhat-linux-gnu >> --build=x86_64-redhat-linux-gnu --target=x86_64-redhat-linux >> --program-prefix= --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin >> --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share >> --includedir=/usr/include --libdir=/usr/lib64 >> --libexecdir=/usr/libexec --localstatedir=/var >> --sharedstatedir=/usr/com --mandir=/usr/share/man >> --infodir=/usr/share/info --disable-gcc-warnings >> --with-pam=/lib64/security --enable-cpuset >> --enable-geometry-requests --enable-blcr --enable-nvidia-gpus >> >> Rolling back to 2.5.7 fixed the problem. > > I have tried this on CentOS 5 and Ubuntu 11.04 without any problem. I am building from the 2.5-fixes branch in Subversion. Is anyone else having an issue compiling PAM on TORQUE with 2.5.8? > > Maybe another good question is "is anyone using PAM with 2.5.8"? > > Regards > Ken Nielson > Adaptive Computing > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 163 bytes Desc: Message signed with OpenPGP using GPGMail Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20111006/630f7b23/attachment.bin From samuel at unimelb.edu.au Sun Oct 9 20:44:04 2011 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Mon, 10 Oct 2011 13:44:04 +1100 Subject: [torqueusers] Mother Superior talking to herself - mom crash Message-ID: <4E925BF4.8050202@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi folks, On Saturday morning (Melbourne time) we had a node (bruce005) lose its pbs_mom with the final message in the logs being the rather cryptic: 10/08/2011 03:36:36;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::check_ms, Mother Superior talking to herself This is with Torque 2.4.16. I've never seen that before, so I was wondering if anyone else had ? The error itself is from check_ms() in src/resmom/mom_comm.c and has the comment: * Check to be sure this is a connection from Mother Superior on * a good port. * Check to make sure I am not Mother Superior (talking to myself). So it appears to be something that's known to occur (or at least be important enough to check that it doesn't happen). Here's the job info in case it helps. [root at bruce-m vlsci]# tracejob -q -n 5 881198 Job: 881198.bruce-m.vlsci.unimelb.edu.au 10/07/2011 14:45:56 S enqueuing into batch, state 1 hop 1 10/07/2011 14:45:56 S Job Queued at request of evan at bruce.vlsci.unimelb.edu.au, owner = evan at bruce.vlsci.unimelb.edu.au, job name = 212221212112212121121221112212211, queue = batch 10/07/2011 14:45:56 A queue=batch 10/08/2011 03:35:54 S Job Run at request of root at bruce-m.vlsci.unimelb.edu.au 10/08/2011 03:36:00 S unable to run job, MOM rejected/timeout 10/08/2011 03:36:03 S Job Run at request of root at bruce-m.vlsci.unimelb.edu.au 10/08/2011 03:36:03 A user=evan group=VR0062 jobname=212221212112212121121221112212211 queue=batch ctime=1317959156 qtime=1317959156 etime=1317959156 start=1318005363 owner=evan at bruce.vlsci.unimelb.edu.au exec_host=bruce005/6+bruce005/5+bruce005/4+bruce005/3+bruce005/1+bruce005/0+bruce006/5+bruce006/1+bruce008/1+bruce008/0+bruce011/0+bruce012/6+bruce012/5+bruce012/4+bruce012/3+bruce108/7 Resource_List.neednodes=1 Resource_List.nodect=1 Resource_List.nodes=1 Resource_List.procs=16 Resource_List.pvmem=3gb Resource_List.walltime=08:00:00 10/08/2011 11:46:05 S Job deleted at request of root at bruce-m.vlsci.unimelb.edu.au 10/08/2011 11:46:05 S Job sent signal SIGTERM on delete 10/08/2011 11:46:05 S purging job without checking MOM 10/08/2011 11:46:05 S dequeuing from batch, state RUNNING 10/08/2011 11:46:05 A requestor=root at bruce-m.vlsci.unimelb.edu.au [root at bruce-m vlsci]# - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk6SW/QACgkQO2KABBYQAh+G4ACeOtTzVeIor4Hg7OpWMS5v6IAJ ijAAnjvX7PKLHIpNkcOeUF14wOohMQwf =n6uA -----END PGP SIGNATURE----- From Gareth.Williams at csiro.au Mon Oct 10 04:07:50 2011 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Mon, 10 Oct 2011 21:07:50 +1100 Subject: [torqueusers] procct and held jobs Message-ID: <007DECE986B47F4EABF823C1FBB19C620102B956AE43@exvic-mbx04.nexus.csiro.au> Hi All, We recently updated torque from 3.0.2 to 3.0.3-snap.201108261653 and have found that at least in some cases, if we submit a job with a hold (with qsub -a to run after a given time) to a routing queue, when the job is released and moves to an execution queue it will still not run because moab 6.0.2 sees a procct GRES. qstat -f shows a procct resource only while the job is held and in the routing queue. Does anyone else with a recent torque version see this problem. You can test with: echo sleep 300 | qsub -a `date -d 'now + 5 minutes' +'%Y%m%d%H%M'` This should hold for 5 minutes then run and sleep for 5 minutes. Gareth For reference, I've worked around the issue by defining in moab a GLOBAL gres called procct with a large count. The same technique would probably work with maui From knielson at adaptivecomputing.com Mon Oct 10 06:45:29 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Mon, 10 Oct 2011 06:45:29 -0600 (MDT) Subject: [torqueusers] procct and held jobs In-Reply-To: <007DECE986B47F4EABF823C1FBB19C620102B956AE43@exvic-mbx04.nexus.csiro.au> Message-ID: <9ec845b1-4ea7-40ef-9e15-9e62ecf361c0@mail> ----- Original Message ----- > From: "Gareth Williams" > To: torqueusers at supercluster.org, moabusers at supercluster.org > Sent: Monday, October 10, 2011 4:07:50 AM > Subject: [torqueusers] procct and held jobs > > Hi All, > > We recently updated torque from 3.0.2 to 3.0.3-snap.201108261653 and > have found that at least in some cases, if we submit a job with a > hold (with qsub -a to run after a given time) to a routing queue, > when the job is released and moves to an execution queue it will > still not run because moab 6.0.2 sees a procct GRES. qstat -f shows > a procct resource only while the job is held and in the routing > queue. > > Does anyone else with a recent torque version see this problem. You > can test with: > echo sleep 300 | qsub -a `date -d 'now + 5 minutes' +'%Y%m%d%H%M'` > > This should hold for 5 minutes then run and sleep for 5 minutes. > > Gareth > > For reference, I've worked around the issue by defining in moab a > GLOBAL gres called procct with a large count. The same technique > would probably work with maui Gareth, That has been fixed in 2.5.8. I need to merge the fix with 3.0-fixes. I will get a snapshot when I do. Ken From knielson at adaptivecomputing.com Mon Oct 10 14:00:23 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Mon, 10 Oct 2011 14:00:23 -0600 (MDT) Subject: [torqueusers] procct and held jobs In-Reply-To: <9ec845b1-4ea7-40ef-9e15-9e62ecf361c0@mail> Message-ID: <3cbf2f38-ef0e-4ebf-af1e-8c41b1e3e92f@mail> ----- Original Message ----- > From: "Ken Nielson" > To: "Torque Users Mailing List" > Cc: moabusers at supercluster.org > Sent: Monday, October 10, 2011 6:45:29 AM > Subject: Re: [torqueusers] procct and held jobs > > > > ----- Original Message ----- > > From: "Gareth Williams" > > To: torqueusers at supercluster.org, moabusers at supercluster.org > > Sent: Monday, October 10, 2011 4:07:50 AM > > Subject: [torqueusers] procct and held jobs > > > > Hi All, > > > > We recently updated torque from 3.0.2 to 3.0.3-snap.201108261653 > > and > > have found that at least in some cases, if we submit a job with a > > hold (with qsub -a to run after a given time) to a routing queue, > > when the job is released and moves to an execution queue it will > > still not run because moab 6.0.2 sees a procct GRES. qstat -f shows > > a procct resource only while the job is held and in the routing > > queue. > > > > Does anyone else with a recent torque version see this problem. > > You > > can test with: > > echo sleep 300 | qsub -a `date -d 'now + 5 minutes' +'%Y%m%d%H%M'` > > > > This should hold for 5 minutes then run and sleep for 5 minutes. > > > > Gareth > > > > For reference, I've worked around the issue by defining in moab a > > GLOBAL gres called procct with a large count. The same technique > > would probably work with maui > > Gareth, > > That has been fixed in 2.5.8. I need to merge the fix with 3.0-fixes. > I will get a snapshot when I do. > I was wrong. This did make it into the snapshot. This is another case where procct is passed up to the scheduler. Another bug. Ken From WJEdsall at dow.com Wed Oct 12 09:16:09 2011 From: WJEdsall at dow.com (Edsall, William (WJ)) Date: Wed, 12 Oct 2011 11:16:09 -0400 Subject: [torqueusers] qstat -u missing data Message-ID: <52CD990A674498429E6A7B4FCAE3F7D3077D9497@USMDLMDOWX025.dow.com> Hello list, What might cause qstat -u to have empty information for nodes and tasks? Torque version: 2.4.12 - installed via the Torque roll in Rocks. qstat -u user cluster: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- 3229.cluster user compute STDIN 24972 -- -- -- -- R -- -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111012/d1c52d97/attachment-0001.html From finklefred98 at yahoo.com Wed Oct 12 12:49:36 2011 From: finklefred98 at yahoo.com (Fred Finkle) Date: Wed, 12 Oct 2011 11:49:36 -0700 (PDT) Subject: [torqueusers] mutiple job array dependency bug Message-ID: <1318445376.29687.YahooMailNeo@web121114.mail.ne1.yahoo.com> Hey all - I am using the job-array dependency functionality and I have found what I think is a repeatable bug in torque-3.0.0. I routine submit a finalization job that depends on hundreds of jobs which are grouped into several job arrays. The finalization job is started BEFORE the all the depending jobs have finished in certain circumstances with respect to the job array run states. I am using the qsub format "-W depend:afterokarray:1[]:2[]" which is working find except for the following case: If the finalization job depends on 2 job arrays finishing and array#1 is partially running (say 5 out of 10 are R, the other 5 are Q) and array#2 finishes completely, at that moment the finalized job is released from H, only to be reset to H since array#1 has not finished yet. Here is the server log showing the state transitions: 10/12/2011 10:58:11;0008;PBS_Server;Job;15320.madrid.local;Setting HOLD_s due to dependencies 10/12/2011 10:58:11;0008;PBS_Server;Job;15320.madrid.local;Setting HOLD_s due to dependencies 10/12/2011 10:58:11;0008;PBS_Server;Job;15320.madrid.local;Setting HOLD_s due to dependencies 10/12/2011 10:58:11;0008;PBS_Server;Job;15320.madrid.local;Setting HOLD_s due to dependencies 10/12/2011 10:58:11;0008;PBS_Server;Job;15320.madrid.local;Clearing HOLD_s due to dependencies 10/12/2011 10:58:12;0008;PBS_Server;Job;15320.madrid.local;Setting HOLD_s due to dependencies 10/12/2011 10:58:12;0008;PBS_Server;Job;15320.madrid.local;Setting HOLD_s due to dependencies 10/12/2011 10:58:12;0008;PBS_Server;Job;15320.madrid.local;Setting HOLD_s due to dependencies 10/12/2011 10:58:12;0008;PBS_Server;Job;15320.madrid.local;Setting HOLD_s due to dependencies 10/12/2011 10:58:26;0008;PBS_Server;Job;15320.madrid.local;Setting HOLD_s due to dependencies 10/12/2011 10:58:26;0008;PBS_Server;Job;15320.madrid.local;Setting HOLD_s due to dependencies 10/12/2011 10:58:26;0008;PBS_Server;Job;15320.madrid.local;Setting HOLD_s due to dependencies Is this expected? This tiny transition is messing our pipeline up since we don't no support checkpointing and the job state gets all screwy from that point onward. thx - Fred -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111012/f735c18d/attachment.html From Gareth.Williams at csiro.au Wed Oct 12 16:53:54 2011 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Thu, 13 Oct 2011 09:53:54 +1100 Subject: [torqueusers] qstat -u missing data In-Reply-To: <52CD990A674498429E6A7B4FCAE3F7D3077D9497@USMDLMDOWX025.dow.com> References: <52CD990A674498429E6A7B4FCAE3F7D3077D9497@USMDLMDOWX025.dow.com> Message-ID: <007DECE986B47F4EABF823C1FBB19C620102BBE54770@exvic-mbx04.nexus.csiro.au> > From: Edsall, William (WJ) [mailto:WJEdsall at dow.com] -snip- > What might cause qstat -u to have empty information for nodes and tasks? If a job does not specify nodes=... and there is no explicit default (which I think is best), then it effectively is the same as requesting nodes=1 or nodes=1:ppn=1 but the job does not have the explicit resources listed, so qstat -a lists the fields with '--'. You can't very well have a default nodes spec if you want to also allow users to request procs. Note that we are not running the same version, but I think that is no particularly important for this query. I'll leave it at that and avoid expressing an opinion on whether the situation is good or bad. Gareth From JRS221 at bham.ac.uk Thu Oct 13 05:08:42 2011 From: JRS221 at bham.ac.uk (Jonathan Smale) Date: Thu, 13 Oct 2011 11:08:42 +0000 Subject: [torqueusers] Assigning Nodes to specific nodes Message-ID: Dear Torque & Maui users, I'm crossposting this to maui & torque users as I'm not sure which of these is posing a problem. I'm trying to make three separate queue for the three different types of nodes on our cluster using the combination of qmgr commands: set queue firstgen resources_default.neednodes = 1stgennodes set nodes compute-0-0 properties = 1stgennodes I've found a fair few previous emails about this and have followed their solution without sucess. I can submit the jobs to the queues but they remain in a queued state, qstat -f gives:- [root at che-hydra /]# qstat -f 48691 Job Id: 48691.che-hydra.bham.ac.uk Job_Name = allenega_p2000_g300_r2.cmd Job_Owner = jsmale at che-hydra.bham.ac.uk resources_used.cput = 529:06:32 resources_used.mem = 16872kb resources_used.vmem = 245908kb resources_used.walltime = 529:11:31 job_state = R queue = default server = che-hydra.bham.ac.uk Checkpoint = u ctime = Wed Sep 21 10:45:00 2011 Error_Path = che-hydra.bham.ac.uk:/home/jsmale/allenega/allenega_p2000_g30 0_r2.cmd.e48691 exec_host = compute-0-16/1 Hold_Types = n Join_Path = n Keep_Files = n Mail_Points = a mtime = Wed Sep 21 10:45:01 2011 Output_Path = che-hydra.bham.ac.uk:/home/jsmale/allenega/allenega_p2000_g3 00_r2.cmd.o48691 Priority = 0 qtime = Wed Sep 21 10:45:00 2011 Rerunable = True Resource_List.neednodes = 1 Resource_List.nodect = 1 Resource_List.nodes = 1 session_id = 20687 substate = 42 Variable_List = PBS_O_HOME=/home/jsmale,PBS_O_LANG=en_US.iso885915, PBS_O_LOGNAME=jsmale, PBS_O_PATH=/home/jsmale/mctdh90.svn/bin/x86_64:/home/jsmale/mctdh90.s vn/bin:/usr/lib64/openmpi/1.3.2-gcc/bin:/usr/kerberos/bin:/usr/java/la test/bin:/usr/local/bin:/bin:/usr/bin:/opt/maui/bin:/opt/torque/bin:/o pt/torque/sbin:/usr/share/pvm3/pvm3//bin/LINUX64:/opt/rocks/bin:/opt/r ocks/sbin:/global64/pgi/linux86-64/10.4/bin:/user/worth/gaussian/bin:/ user/jsmale/bin:/home/gaussian/bin:~/mctdh90.svn/bin:/home/jsmale/bin, PBS_O_MAIL=/var/spool/mail/jsmale,PBS_O_SHELL=/bin/bash, PBS_SERVER=che-hydra.bham.ac.uk,PBS_O_HOST=che-hydra.bham.ac.uk, PBS_O_WORKDIR=/home/jsmale/allenega,PBS_O_QUEUE=default euser = jsmale egroup = worth hashname = 48691.che-hydra.bham.ac.uk queue_rank = 35491 queue_type = E etime = Wed Sep 21 10:45:00 2011 submit_args = allenega_p2000_g300_r2.cmd start_time = Wed Sep 21 10:45:01 2011 start_count = 1 and checkjob gives the following: [root at che-hydra /]# checkjob 48691 checking job 48691 State: Running Creds: user:jsmale group:worth class:default qos:DEFAULT WallTime: 22:01:12:16 of 99:23:59:59 SubmitTime: Wed Sep 21 10:45:00 (Time Queued Total: 00:00:01 Eligible: 00:00:00) StartTime: Wed Sep 21 10:45:01 Total Tasks: 1 Req[0] TaskCount: 1 Partition: DEFAULT Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [NONE] NodeCount: 1 Allocated Nodes: [compute-0-16:1] IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 1 PartitionMask: [ALL] Flags: RESTARTABLE Reservation '48691' ( -INFINITY -> 77:22:47:56 Duration: 99:23:59:59) PE: 1.00 StartPriority: 20 I'm not sure why the job isn't running, there doesn't seem to be any reason given in either the maui or torque (server&mom) logs. Could anyone help me decipher the cause? Configuration of the server follows. Top of maui.cfg file: # maui.cfg 3.2.6p20 SERVERHOST che-hydra.bham.ac.uk # primary admin must be first in list ADMIN1 root # Resource Manager Definition RMCFG[che-hydra.bham.ac.uk] TYPE=PBS # Allocation Manager Definition AMCFG[bank] TYPE=NONE # full parameter docs at http://supercluster.org/mauidocs/a.fparameters.html # use the 'schedctl -l' command to display current configuration RMPOLLINTERVAL 00:00:30 SERVERPORT 42559 SERVERMODE NORMAL # Admin: http://supercluster.org/mauidocs/a.esecurity.html LOGFILE maui.log LOGFILEMAXSIZE 100000000 LOGLEVEL 3 # Setting up node information for throttling policies # NODECFG[compute-0-0] SPEED=1 MAXJOB=4 nodetype=firstgennodes NODECFG[compute-0-1] SPEED=1 MAXJOB=4 nodetype=firstgennodes NODECFG[compute-0-2] SPEED=1 MAXJOB=4 nodetype=firstgennodes NODECFG[compute-0-3] SPEED=1 MAXJOB=4 nodetype=firstgennodes NODECFG[compute-0-4] SPEED=1.2 MAXJOB=8 nodetype=secondgennodes NODECFG[compute-0-5] SPEED=1.2 MAXJOB=8 nodetype=secondgennodes NODECFG[compute-0-6] SPEED=1.2 MAXJOB=8 nodetype=secondgennodes NODECFG[compute-0-7] SPEED=1.2 MAXJOB=8 nodetype=secondgennodes NODECFG[compute-0-8] SPEED=1.4 MAXJOB=8 nodetype=secondgennodes NODECFG[compute-0-9] SPEED=1.4 MAXJOB=8 nodetype=secondgennodes NODECFG[compute-0-10] SPEED=1.4 MAXJOB=8 nodetype=secondgennodes NODECFG[compute-0-11] SPEED=1.5 MAXJOB=8 nodetype=secondgennodes NODECFG[compute-0-12] SPEED=1.5 MAXJOB=8 nodetype=secondgennodes NODECFG[compute-0-13] SPEED=1.5 MAXJOB=8 nodetype=secondgennodes NODECFG[compute-0-14] SPEED=1.5 MAXJOB=8 nodetype=secondgennodes NODECFG[compute-0-15] SPEED=1.5 MAXJOB=8 nodetype=secondgennodes NODECFG[compute-0-16] SPEED=1.7 MAXJOB=16 nodetype=thirdgennodes NODECFG[compute-0-17] SPEED=1.7 MAXJOB=16 nodetype=thirdgennodes NODECFG[compute-0-18] SPEED=1.7 MAXJOB=16 nodetype=thirdgennodes NODECFG[compute-0-19] SPEED=1.7 MAXJOB=16 nodetype=thirdgennodes # Setting up queue information to allow allocation to specfic types of nodes via queues CLASSCFG[firstgen] hostlist = compute-0-0,compute-0-1,compute-0-2,compute-0-3 CLASSCFG[secondgen] hostlist = compute-0-4,compute-0-5,compute-0-6,compute-0-7,compute-0-8,compute-0-9,compute-0-10,compute-0-11,compute-0-12,compute-0-13,compute-0-14,compute-0-15 CLASSCFG[thirdgen] hostlist = compute-0-16,compute-0-17,compute-0-18,compute-0-19 # Backfill: http://supercluster.org/mauidocs/8.2backfill.html BACKFILLPOLICY FIRSTFIT RESERVATIONPOLICY CURRENTHIGHEST # Node Allocation: http://supercluster.org/mauidocs/5.2nodeallocation.html NODEALLOCATIONPOLICY CPULOAD Some torque setting that might be of use: [root at che-hydra]# pbsnodes (truncated, one example of each type of node compute-0-0 state = free np = 4 properties = firstgennodes ntype = cluster status = opsys=linux,uname=Linux compute-0-0.local 2.6.18-164.6.1.el5 #1 SMP Tue Nov 3 16:12:36 EST 2009 x86_64,sessions=? 15201,nsessions=? 15201,nusers=0,idletime=13287377,totmem=9195716kb,availmem=8926420kb,physmem=8175600kb,ncpus=4,loadave=0.00,netload=54589282244,state=free,jobs=,varattr=,rectime=1318502240 compute-0-4 state = free np = 8 properties = secondgennodes ntype = cluster status = opsys=linux,uname=Linux compute-0-4.local 2.6.18-164.6.1.el5 #1 SMP Tue Nov 3 16:12:36 EST 2009 x86_64,sessions=? 15201,nsessions=? 15201,nusers=0,idletime=21840103,totmem=17464156kb,availmem=17170748kb,physmem=16444040kb,ncpus=8,loadave=0.00,netload=494140575539,state=free,jobs=,varattr=,rectime=1318502242 compute-0-16 state = free np = 16 properties = thirdgennodes ntype = cluster jobs = 0/48738.che-hydra.bham.ac.uk, 1/48691.che-hydra.bham.ac.uk, 3/48693.che-hydra.bham.ac.uk status = opsys=linux,uname=Linux compute-0-16.local 2.6.18-164.6.1.el5 #1 SMP Tue Nov 3 16:12:36 EST 2009 x86_64,sessions=6691 20687 20764,nsessions=3,nusers=2,idletime=7342084,totmem=17461096kb,availmem=8761384kb,physmem=16440980kb,ncpus=16,loadave=3.08,netload=647310098799,state=free,jobs=48691.che-hydra.bham.ac.uk 48693.che-hydra.bham.ac.uk 48738.che-hydra.bham.ac.uk,varattr=,rectime=1318502257 [root at che-hydra]# qmgr -c "p s" # # Create queues and set their attributes. # # # Create and define queue default # create queue default set queue default queue_type = Execution set queue default Priority = 100 set queue default resources_default.nodes = 1 set queue default enabled = True set queue default started = True # # Create and define queue secondgen # create queue secondgen set queue secondgen queue_type = Execution set queue secondgen Priority = 100 set queue secondgen acl_host_enable = False set queue secondgen acl_hosts = che-hydra+localhost set queue secondgen resources_default.neednodes = secondgennodes set queue secondgen resources_default.nodes = 1 set queue secondgen enabled = True set queue secondgen started = True # # Create and define queue thirdgen # create queue thirdgen set queue thirdgen queue_type = Execution set queue thirdgen Priority = 100 set queue thirdgen acl_host_enable = False set queue thirdgen acl_hosts = che-hydra+localhost set queue thirdgen resources_default.neednodes = thirdgennodes set queue thirdgen resources_default.nodes = 1 set queue thirdgen enabled = True set queue thirdgen started = True # # Create and define queue firstgen # create queue firstgen set queue firstgen queue_type = Execution set queue firstgen Priority = 100 set queue firstgen acl_host_enable = False set queue firstgen acl_hosts = che-hydra+localhost set queue firstgen resources_default.neednodes = firstgennodes set queue firstgen resources_default.nodes = 1 set queue firstgen enabled = True set queue firstgen started = True # # Set server attributes. # set server scheduling = True set server acl_host_enable = False set server acl_hosts = che-hydra.bham.ac.uk set server default_queue = default set server log_events = 511 set server mail_from = adm set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 6 set server auto_node_np = True set server next_job_number = 49702 Jonathan Smale From brianm at usc.edu Thu Oct 13 09:34:39 2011 From: brianm at usc.edu (Brian Mendenhall) Date: Thu, 13 Oct 2011 08:34:39 -0700 Subject: [torqueusers] [Mauiusers] Assigning Nodes to specific nodes In-Reply-To: References: Message-ID: <1318520079.15612.412.camel@orion.usc.edu> On Thu, 2011-10-13 at 11:08 +0000, Jonathan Smale wrote: > > set queue firstgen resources_default.neednodes = 1stgennodes > set nodes compute-0-0 properties = 1stgennodes > Not sure if you were just trying to be brief or if it was simply a typo, but here you have said you set the node compute-0-0's properties to '1stgennodes', and later you have it as 'firstgennodes'. If there is a difference, then that could be the problem. > I've found a fair few previous emails about this and have followed their solution without sucess. I can submit the jobs to the queues but they remain in a queued state, qstat -f gives:- > > [root at che-hydra /]# qstat -f 48691 > Job Id: 48691.che-hydra.bham.ac.uk > Job_Name = allenega_p2000_g300_r2.cmd > Job_Owner = jsmale at che-hydra.bham.ac.uk > resources_used.cput = 529:06:32 > resources_used.mem = 16872kb > resources_used.vmem = 245908kb > resources_used.walltime = 529:11:31 > job_state = R I do believe that 'job_state = R' means it is running from torque's perspective... Someone correct me if I'm wrong, please. > queue = default Please note the queue it is in, default, *not* firstgen, this is due to the definition of the default queue, below I'll comment more ... > > checking job 48691 > > State: Running > Creds: user:jsmale group:worth class:default qos:DEFAULT According to Maui, the job is running as well, in class 'default' > [root at che-hydra]# qmgr -c "p s" > # > # Create queues and set their attributes. > # > # > # Create and define queue default > # > create queue default > set queue default queue_type = Execution > set queue default Priority = 100 > set queue default resources_default.nodes = 1 > set queue default enabled = True > set queue default started = True So, I think that here is the root of your problem. You have configured the first queue that a job will be set to (in the server config below) is the 'default' queue, but this queue is an execution queue which means that jobs are going to 'run' in this queue. What I believe you want this queue to be is a routing queue, with route_destinations of the queues you want your jobs to run in. For example, my default queue configuration is thus: create queue default set queue default queue_type = Route set queue default max_queuable = 2000 set queue default resources_default.nodes = 1 set queue default resources_default.walltime = 00:30:00 set queue default route_destinations = quick set queue default route_destinations += long set queue default route_destinations += large_long set queue default route_destinations += large_route set queue default route_destinations += main_route set queue default route_waiting_jobs = True set queue default route_lifetime = 600 set queue default enabled = True set queue default started = True Please take note that torque will run the job in the FIRST place that it *can* run in (based on neednodes, resources_min, resources_max), so you should also order your queues appropriately. > # > # Set server attributes. > # > set server scheduling = True > set server acl_host_enable = False > set server acl_hosts = che-hydra.bham.ac.uk > set server default_queue = default ^^^ this is the important line that sets the first queue destination. All jobs that do not specify a queue name will go through this named queue. I hope this helps, -- Brian Mendenhall Linux/HPCC Administrator University of Southern California From halmabrazi at idtdna.com Thu Oct 13 13:34:22 2011 From: halmabrazi at idtdna.com (Hakeem Almabrazi) Date: Thu, 13 Oct 2011 19:34:22 +0000 Subject: [torqueusers] Node is not responding! Message-ID: Dear All, First, I am newbie to Torque and this is my first message to this group. I hope I will not waste anyone's time by asking such stupid question but I have tried to look for some answers in the archived listinfo but since there is no search capabilities built in I find it harder to find what I need. Here is where I am so far: I installed the Torque 3.0 package on my Linux box (SUSE 11.2). I also configured a node on a different VM that is running SUSE as well. It seems things are installed and configured correctly (I think). When I run the pbsnodes I get suse-ptpd-16 state = free np = 1 ntype = cluster status = rectime=1318533469,varattr=,jobs=,state=free,netload=116214003,gres=,loadave=0.00,ncpus=1,physmem=1017908kb,availmem=3012532kb,totmem=3115056kb,idletime=76,nusers=2,nsessions=7,sessions=1753 1767 1770 1889 1894 1997 3017,uname=Linux suse-ptpd-16 2.6.34-12-desktop #1 SMP PREEMPT 2010-06-29 02:39:08 +0200 i686,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 When I shut down the node it changes to "down" in the state. This tells me everything is okay. However, when I tried to send my first job to the node. I used this example found online >test.job #!/bin/bash # --- send the output to the test.out file # the default is .o #PBS -o test.out # --- send the error output to the test.err file # the default is .e #PBS -e test.err echo "Print out the hostname and date" /bin/hostname /bin/date exit 0 And then I ran it from the head node (not as a root) >qsub test.job Looking at the submitted jobs ( I submitted the jobs twice) >qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 16.suse-halmabr test.job torqueuser 0 Q batch 17.suse-halmabr test.job torqueuser 0 Q batch However, nothing seems to be happening after that. Can any body tell me what I am doing wrong or if I am missing something here? Also, it will be great if someone can direct me to the right site for examples on how to use the server that will be highly appreciated. Regards, ~Hak -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111013/802ffcb0/attachment.html From scrusan at ur.rochester.edu Thu Oct 13 16:47:28 2011 From: scrusan at ur.rochester.edu (Steve Crusan) Date: Thu, 13 Oct 2011 18:47:28 -0400 Subject: [torqueusers] Node is not responding! In-Reply-To: References: Message-ID: <114AA362-3E4D-4A32-94DF-6ADCE839FBAC@ur.rochester.edu> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Oct 13, 2011, at 3:34 PM, Hakeem Almabrazi wrote: > Dear All, > > First, I am newbie to Torque and this is my first message to this group. I hope I will not waste anyone's time by asking such stupid question but I have tried to look for some answers in the archived listinfo but since there is no search capabilities built in I find it harder to find what I need. > > Here is where I am so far: > > I installed the Torque 3.0 package on my Linux box (SUSE 11.2). I also configured a node on a different VM that is running SUSE as well. It seems things are installed and configured correctly (I think). > > When I run the pbsnodes I get > > suse-ptpd-16 > state = free > np = 1 > ntype = cluster > status = rectime=1318533469,varattr=,jobs=,state=free,netload=116214003,gres=,loadave=0.00,ncpus=1,physmem=1017908kb,availmem=3012532kb,totmem=3115056kb,idletime=76,nusers=2,nsessions=7,sessions=1753 1767 1770 1889 1894 1997 3017,uname=Linux suse-ptpd-16 2.6.34-12-desktop #1 SMP PREEMPT 2010-06-29 02:39:08 +0200 i686,opsys=linux > mom_service_port = 15002 > mom_manager_port = 15003 > gpus = 0 > > When I shut down the node it changes to "down" in the state. This tells me everything is okay. > > However, when I tried to send my first job to the node. I used this example found online > >> test.job > > #!/bin/bash > # --- send the output to the test.out file > # the default is .o > #PBS -o test.out > # --- send the error output to the test.err file > # the default is .e > #PBS -e test.err > > echo "Print out the hostname and date" > /bin/hostname > /bin/date > exit 0 > > And then I ran it from the head node (not as a root) > >> qsub test.job > > Looking at the submitted jobs ( I submitted the jobs twice) > >> qstat > Job id Name User Time Use S Queue > ------------------------- ---------------- --------------- -------- - ----- > 16.suse-halmabr test.job torqueuser 0 Q batch > 17.suse-halmabr test.job torqueuser 0 Q batch > > > However, nothing seems to be happening after that. > > Can any body tell me what I am doing wrong or if I am missing something here? Also, it will be great if someone can direct me to the right site for examples on how to use the server that will be highly appreciated. What scheduler are you using above TORQUE ( PBS_SCHED, MAUI, etc)? If there is nothing that tells TORQUE to run the job, it won't run (unless you force it to run with qrun ). If everything is setup right with a scheduler and such, you should be able to submit a job interactively to run some basic tests, a la: qsub -I -q batch ,etc,etc... If you are using Maui/Moab as a scheduler, run a checkjob -v which will give you some information on why the scheduler isn't starting the job. Here is a link to integrating TORQUE+MAUI [open source]: http://www.adaptivecomputing.com/resources/docs/maui/pbsintegration.php You should be able to use the base Maui setup with TORQUE. Hope that helps. ~Steve > > Regards, > > ~Hak > > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers ---------------------- Steve Crusan System Administrator Center for Research Computing University of Rochester https://www.crc.rochester.edu/ -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.17 (Darwin) Comment: GPGTools - http://gpgtools.org iQEcBAEBAgAGBQJOl2qHAAoJENS19LGOpgqK/CUH/0xmHHHhXYq0AHpM7FqP7d5L xQRtNpbxlVejU68XFfxM7lHA/pp6lzb/niFBZG4ujHMofv84qKnAq7vFBskgIhKm 9AtTU+W3PkkjdHWS7lhYkSt0Wun+k0te5TMu/QfBfKpLTioEU0SqlHG+RhGgaWcn ow5/+TXN/tQCTayaZko7m/VbOcFv258B1lqEQFwczf7KgUcUDKsYc27lZlxG3IGj 20CGHGytSueGwIv1YdD9QRo7zFXMy49keN7z8nDaasDuBCMtsBpJrX2Xt77zvfcY ncyWknmd9+hgiAenLptFjm3T6Mt7Q+BDEWCT58buUQtgODHMiRMIHEwhTW1WE3w= =uEFl -----END PGP SIGNATURE----- From halmabrazi at idtdna.com Fri Oct 14 10:38:32 2011 From: halmabrazi at idtdna.com (Hakeem Almabrazi) Date: Fri, 14 Oct 2011 16:38:32 +0000 Subject: [torqueusers] Node is not responding! In-Reply-To: <114AA362-3E4D-4A32-94DF-6ADCE839FBAC@ur.rochester.edu> References: <114AA362-3E4D-4A32-94DF-6ADCE839FBAC@ur.rochester.edu> Message-ID: Steve, Thank you for your tip. That was the cause of my issue. I had to start the server_sched daemon. I started looking at the Maui scheduler. It looks like another big learning curve I need to do. Again, I appreciate your help, ~Hak -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Steve Crusan Sent: Thursday, October 13, 2011 5:47 PM To: Torque Users Mailing List Subject: Re: [torqueusers] Node is not responding! -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Oct 13, 2011, at 3:34 PM, Hakeem Almabrazi wrote: > Dear All, > > First, I am newbie to Torque and this is my first message to this group. I hope I will not waste anyone's time by asking such stupid question but I have tried to look for some answers in the archived listinfo but since there is no search capabilities built in I find it harder to find what I need. > > Here is where I am so far: > > I installed the Torque 3.0 package on my Linux box (SUSE 11.2). I also configured a node on a different VM that is running SUSE as well. It seems things are installed and configured correctly (I think). > > When I run the pbsnodes I get > > suse-ptpd-16 > state = free > np = 1 > ntype = cluster > status = rectime=1318533469,varattr=,jobs=,state=free,netload=116214003,gres=,loadave=0.00,ncpus=1,physmem=1017908kb,availmem=3012532kb,totmem=3115056kb,idletime=76,nusers=2,nsessions=7,sessions=1753 1767 1770 1889 1894 1997 3017,uname=Linux suse-ptpd-16 2.6.34-12-desktop #1 SMP PREEMPT 2010-06-29 02:39:08 +0200 i686,opsys=linux > mom_service_port = 15002 > mom_manager_port = 15003 > gpus = 0 > > When I shut down the node it changes to "down" in the state. This tells me everything is okay. > > However, when I tried to send my first job to the node. I used this example found online > >> test.job > > #!/bin/bash > # --- send the output to the test.out file > # the default is .o > #PBS -o test.out > # --- send the error output to the test.err file > # the default is .e > #PBS -e test.err > > echo "Print out the hostname and date" > /bin/hostname > /bin/date > exit 0 > > And then I ran it from the head node (not as a root) > >> qsub test.job > > Looking at the submitted jobs ( I submitted the jobs twice) > >> qstat > Job id Name User Time Use S Queue > ------------------------- ---------------- --------------- -------- - ----- > 16.suse-halmabr test.job torqueuser 0 Q batch > 17.suse-halmabr test.job torqueuser 0 Q batch > > > However, nothing seems to be happening after that. > > Can any body tell me what I am doing wrong or if I am missing something here? Also, it will be great if someone can direct me to the right site for examples on how to use the server that will be highly appreciated. What scheduler are you using above TORQUE ( PBS_SCHED, MAUI, etc)? If there is nothing that tells TORQUE to run the job, it won't run (unless you force it to run with qrun ). If everything is setup right with a scheduler and such, you should be able to submit a job interactively to run some basic tests, a la: qsub -I -q batch ,etc,etc... If you are using Maui/Moab as a scheduler, run a checkjob -v which will give you some information on why the scheduler isn't starting the job. Here is a link to integrating TORQUE+MAUI [open source]: http://www.adaptivecomputing.com/resources/docs/maui/pbsintegration.php You should be able to use the base Maui setup with TORQUE. Hope that helps. ~Steve > > Regards, > > ~Hak > > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers ---------------------- Steve Crusan System Administrator Center for Research Computing University of Rochester https://www.crc.rochester.edu/ -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.17 (Darwin) Comment: GPGTools - http://gpgtools.org iQEcBAEBAgAGBQJOl2qHAAoJENS19LGOpgqK/CUH/0xmHHHhXYq0AHpM7FqP7d5L xQRtNpbxlVejU68XFfxM7lHA/pp6lzb/niFBZG4ujHMofv84qKnAq7vFBskgIhKm 9AtTU+W3PkkjdHWS7lhYkSt0Wun+k0te5TMu/QfBfKpLTioEU0SqlHG+RhGgaWcn ow5/+TXN/tQCTayaZko7m/VbOcFv258B1lqEQFwczf7KgUcUDKsYc27lZlxG3IGj 20CGHGytSueGwIv1YdD9QRo7zFXMy49keN7z8nDaasDuBCMtsBpJrX2Xt77zvfcY ncyWknmd9+hgiAenLptFjm3T6Mt7Q+BDEWCT58buUQtgODHMiRMIHEwhTW1WE3w= =uEFl -----END PGP SIGNATURE----- _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From halmabrazi at idtdna.com Fri Oct 14 15:36:06 2011 From: halmabrazi at idtdna.com (Hakeem Almabrazi) Date: Fri, 14 Oct 2011 21:36:06 +0000 Subject: [torqueusers] Node is not responding! In-Reply-To: <114AA362-3E4D-4A32-94DF-6ADCE839FBAC@ur.rochester.edu> References: <114AA362-3E4D-4A32-94DF-6ADCE839FBAC@ur.rochester.edu> Message-ID: Steve, Another easy question, where the output of my script should be stored at? For example looking the script below, I could not find the test.o or test.err files in the node where I submitted the job from. Thanks Hak -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Steve Crusan Sent: Thursday, October 13, 2011 5:47 PM To: Torque Users Mailing List Subject: Re: [torqueusers] Node is not responding! -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Oct 13, 2011, at 3:34 PM, Hakeem Almabrazi wrote: > Dear All, > > First, I am newbie to Torque and this is my first message to this group. I hope I will not waste anyone's time by asking such stupid question but I have tried to look for some answers in the archived listinfo but since there is no search capabilities built in I find it harder to find what I need. > > Here is where I am so far: > > I installed the Torque 3.0 package on my Linux box (SUSE 11.2). I also configured a node on a different VM that is running SUSE as well. It seems things are installed and configured correctly (I think). > > When I run the pbsnodes I get > > suse-ptpd-16 > state = free > np = 1 > ntype = cluster > status = rectime=1318533469,varattr=,jobs=,state=free,netload=116214003,gres=,loadave=0.00,ncpus=1,physmem=1017908kb,availmem=3012532kb,totmem=3115056kb,idletime=76,nusers=2,nsessions=7,sessions=1753 1767 1770 1889 1894 1997 3017,uname=Linux suse-ptpd-16 2.6.34-12-desktop #1 SMP PREEMPT 2010-06-29 02:39:08 +0200 i686,opsys=linux > mom_service_port = 15002 > mom_manager_port = 15003 > gpus = 0 > > When I shut down the node it changes to "down" in the state. This tells me everything is okay. > > However, when I tried to send my first job to the node. I used this example found online > >> test.job > > #!/bin/bash > # --- send the output to the test.out file > # the default is .o > #PBS -o test.out > # --- send the error output to the test.err file > # the default is .e > #PBS -e test.err > > echo "Print out the hostname and date" > /bin/hostname > /bin/date > exit 0 > > And then I ran it from the head node (not as a root) > >> qsub test.job > > Looking at the submitted jobs ( I submitted the jobs twice) > >> qstat > Job id Name User Time Use S Queue > ------------------------- ---------------- --------------- -------- - ----- > 16.suse-halmabr test.job torqueuser 0 Q batch > 17.suse-halmabr test.job torqueuser 0 Q batch > > > However, nothing seems to be happening after that. > > Can any body tell me what I am doing wrong or if I am missing something here? Also, it will be great if someone can direct me to the right site for examples on how to use the server that will be highly appreciated. What scheduler are you using above TORQUE ( PBS_SCHED, MAUI, etc)? If there is nothing that tells TORQUE to run the job, it won't run (unless you force it to run with qrun ). If everything is setup right with a scheduler and such, you should be able to submit a job interactively to run some basic tests, a la: qsub -I -q batch ,etc,etc... If you are using Maui/Moab as a scheduler, run a checkjob -v which will give you some information on why the scheduler isn't starting the job. Here is a link to integrating TORQUE+MAUI [open source]: http://www.adaptivecomputing.com/resources/docs/maui/pbsintegration.php You should be able to use the base Maui setup with TORQUE. Hope that helps. ~Steve > > Regards, > > ~Hak > > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers ---------------------- Steve Crusan System Administrator Center for Research Computing University of Rochester https://www.crc.rochester.edu/ -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.17 (Darwin) Comment: GPGTools - http://gpgtools.org iQEcBAEBAgAGBQJOl2qHAAoJENS19LGOpgqK/CUH/0xmHHHhXYq0AHpM7FqP7d5L xQRtNpbxlVejU68XFfxM7lHA/pp6lzb/niFBZG4ujHMofv84qKnAq7vFBskgIhKm 9AtTU+W3PkkjdHWS7lhYkSt0Wun+k0te5TMu/QfBfKpLTioEU0SqlHG+RhGgaWcn ow5/+TXN/tQCTayaZko7m/VbOcFv258B1lqEQFwczf7KgUcUDKsYc27lZlxG3IGj 20CGHGytSueGwIv1YdD9QRo7zFXMy49keN7z8nDaasDuBCMtsBpJrX2Xt77zvfcY ncyWknmd9+hgiAenLptFjm3T6Mt7Q+BDEWCT58buUQtgODHMiRMIHEwhTW1WE3w= =uEFl -----END PGP SIGNATURE----- _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From jjc at iastate.edu Fri Oct 14 15:44:43 2011 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Fri, 14 Oct 2011 16:44:43 -0500 Subject: [torqueusers] Policies for scheduling with unusual/reserved-use nodes. Message-ID: All, I'm running Torque 2.5.4 on an homogenous cluster with just Opteron Cpus, and someone wants to add similar machines with two GPU cards in them. I am unsure whether this person will want these machines held exclusively for his group's use, and whether/how Torque can accommodate this. How have others handled the technical end of this? I know that users can easily specify -l nodes=x:ppn=y:gpus=z to get on specifically those nodes, and I can set a property for specific nodes, e.g. nogpu so that a job can specify -l nodes=x:ppn=y:nogpu so that the job would specifically avoid those nodes (and maybe put that in a job wrapper script.) However, what do you do if the new group wants "nobody else can run on my nodes" in an environment where users could specify gpus=z even when they did not need it, or just leave off the nogpu so that they get scheduled on any available nodes. We've never had this, as all the machines were the same, but I may need to implement it, likely be on short notice, so I want to be ahead of the curve. Can such a policy be implemented with preferably via 1) pbs_sched or 2) MAUI ? I can probably hack something together myself, but I'd guess that others must have crossed this bridge, and I'd like learn from those with this experience. Thanks, - Jim James Coyle, PhD High Performance Computing Group Iowa State Univ. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111014/cd668424/attachment.html From scrusan at ur.rochester.edu Fri Oct 14 16:18:53 2011 From: scrusan at ur.rochester.edu (Steve Crusan) Date: Fri, 14 Oct 2011 18:18:53 -0400 Subject: [torqueusers] Policies for scheduling with unusual/reserved-use nodes. In-Reply-To: References: Message-ID: <8C049387-9FDE-48CB-85F0-E9999461EDCE@ur.rochester.edu> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Oct 14, 2011, at 5:44 PM, Coyle, James J [ITACD] wrote: > All, > > I'm running Torque 2.5.4 on an homogenous cluster with just Opteron Cpus, and someone > wants to add similar machines with two GPU cards in them. > > I am unsure whether this person will want these machines held exclusively for > his group's use, and whether/how Torque can accommodate this. > > How have others handled the technical end of this? > > I know that users can easily specify > -l nodes=x:ppn=y:gpus=z > to get on specifically those nodes, and I can set a property for specific nodes, e.g. > nogpu so that a job can specify > -l nodes=x:ppn=y:nogpu > so that the job would specifically avoid those nodes (and maybe put that > in a job wrapper script.) > > However, what do you do if the new group wants "nobody else can run on my nodes" > in an environment where users could specify gpus=z even when they did not need it, > or just leave off the nogpu so that they get scheduled on any available nodes. If I'm not mistaken, you want to just protect privately purchased resources from being used by members of the shared resources? Meaning, if you don't pay, you don't get to play on those nodes? If that's the case, you can simply do this with TORQUE. If all of your non-privately funded nodes have one similar feature, you could set your 'standard' queue with a resources_default.neednodes = featurename. For the private nodes, create a queue that has a neednodes feature that ONLY the privately funded hardware has, and then set a group acl on that queue. So basically standard nodes require a common feature anyone can use, but private nodes require a feature that only certain groups can use based off of TORQUE acl_groups = groupname statements. If you're interested in using Maui to achieve a setup like this, you can map classes/qos -> a standing reservation that contains those gpu nodes. Basically: (this can also be done with CLASS=) CLASSCFG[econ] QLIST=econqos QDEF=econqos SRCFG[privhwres] OWNER=QOS:econqos SRCFG[privhwres] QOSLIST=econqos SRCFG[privhwres] HOSTLIST=node0[1-4] You can just create a standing reservation with access restrictions, whether it be CLASS,QOS,USER,ACCOUNT, etc. What we do is map a class (class has TORQUE acl constraints) -> qos -> standing reservation. We haven't had any problems. Let me know if this helps. ~Steve > > We've never had this, as all the machines were the same, but I may need to implement it, > likely be on short notice, so I want to be ahead of the curve. > > Can such a policy be implemented with > preferably via > > 1) pbs_sched or > > 2) MAUI ? > > I can probably hack something together myself, but I'd guess that others must > have crossed this bridge, and I'd like learn from those with this experience. > > Thanks, > > - Jim > > James Coyle, PhD > High Performance Computing Group > Iowa State Univ. > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers ---------------------- Steve Crusan System Administrator Center for Research Computing University of Rochester https://www.crc.rochester.edu/ -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.17 (Darwin) Comment: GPGTools - http://gpgtools.org iQEcBAEBAgAGBQJOmLVTAAoJENS19LGOpgqKGR0IAI9EjIz4cMEqmR/ope2iKn44 2GVFvWfY4EWz8Gtyg39DgmwfxF2506tNBxskB3YNvTD4MMUt38083PzLFdV8/kVz dHkTCa719Z58s5k/k3ruYE/HR2sSIvuBLBcmsIm1fS1x67K3cfAJ5TycmHdCLw5a ztgcovtFZ+LXFvSBPjOyjEn5fnG76YLtq3lcU/7abw8VeQ8nM3C9nnpzDnuaPUVG rnyzO7mtvrAgSiaV+WVDtVCOi9Iz7qM7UwLR00dbJyChE3W0QdYpDYR9E8y/Upnt 70/oFsHXApenlo+9I7d9Z7DZQ11clrMoDQpBg1vja52Go/A6/t4cjEdRn8TFEKE= =0TKK -----END PGP SIGNATURE----- From gus at ldeo.columbia.edu Sat Oct 15 08:50:32 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Sat, 15 Oct 2011 10:50:32 -0400 Subject: [torqueusers] Node is not responding! In-Reply-To: References: <114AA362-3E4D-4A32-94DF-6ADCE839FBAC@ur.rochester.edu> Message-ID: <71BB9A0F-59E3-41B3-89C0-E34CC6E6284E@ldeo.columbia.edu> Hi Hakeem Look for them in ${TORQUE}/spool or ${TORQUE}/undelivered in the 'mother superior' node, i.e., the first node Torque gave to your job, wherever you installed ${TORQUE}. This is for jobs that fail ungracefully. You can find out which node was it with 'qstat -n' or looking up the job number in the Torque pbs_server logs. My two cents, Gus Correa On Oct 14, 2011, at 5:36 PM, Hakeem Almabrazi wrote: > Steve, > > Another easy question, where the output of my script should be stored at? For example looking the script below, I could not find the test.o or test.err files in the node where I submitted the job from. > > Thanks > Hak > > -----Original Message----- > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Steve Crusan > Sent: Thursday, October 13, 2011 5:47 PM > To: Torque Users Mailing List > Subject: Re: [torqueusers] Node is not responding! > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > > On Oct 13, 2011, at 3:34 PM, Hakeem Almabrazi wrote: > >> Dear All, >> >> First, I am newbie to Torque and this is my first message to this group. I hope I will not waste anyone's time by asking such stupid question but I have tried to look for some answers in the archived listinfo but since there is no search capabilities built in I find it harder to find what I need. >> >> Here is where I am so far: >> >> I installed the Torque 3.0 package on my Linux box (SUSE 11.2). I also configured a node on a different VM that is running SUSE as well. It seems things are installed and configured correctly (I think). >> >> When I run the pbsnodes I get >> >> suse-ptpd-16 >> state = free >> np = 1 >> ntype = cluster >> status = rectime=1318533469,varattr=,jobs=,state=free,netload=116214003,gres=,loadave=0.00,ncpus=1,physmem=1017908kb,availmem=3012532kb,totmem=3115056kb,idletime=76,nusers=2,nsessions=7,sessions=1753 1767 1770 1889 1894 1997 3017,uname=Linux suse-ptpd-16 2.6.34-12-desktop #1 SMP PREEMPT 2010-06-29 02:39:08 +0200 i686,opsys=linux >> mom_service_port = 15002 >> mom_manager_port = 15003 >> gpus = 0 >> >> When I shut down the node it changes to "down" in the state. This tells me everything is okay. >> >> However, when I tried to send my first job to the node. I used this example found online >> >>> test.job >> >> #!/bin/bash >> # --- send the output to the test.out file >> # the default is .o >> #PBS -o test.out >> # --- send the error output to the test.err file >> # the default is .e >> #PBS -e test.err >> >> echo "Print out the hostname and date" >> /bin/hostname >> /bin/date >> exit 0 >> >> And then I ran it from the head node (not as a root) >> >>> qsub test.job >> >> Looking at the submitted jobs ( I submitted the jobs twice) >> >>> qstat >> Job id Name User Time Use S Queue >> ------------------------- ---------------- --------------- -------- - ----- >> 16.suse-halmabr test.job torqueuser 0 Q batch >> 17.suse-halmabr test.job torqueuser 0 Q batch >> >> >> However, nothing seems to be happening after that. >> >> Can any body tell me what I am doing wrong or if I am missing something here? Also, it will be great if someone can direct me to the right site for examples on how to use the server that will be highly appreciated. > > > What scheduler are you using above TORQUE ( PBS_SCHED, MAUI, etc)? If there is nothing that tells TORQUE to run the job, it won't run (unless you force it to run with qrun ). > > If everything is setup right with a scheduler and such, you should be able to submit a job interactively to run some basic tests, a la: qsub -I -q batch ,etc,etc... > > If you are using Maui/Moab as a scheduler, run a checkjob -v which will give you some information on why the scheduler isn't starting the job. > > Here is a link to integrating TORQUE+MAUI [open source]: > http://www.adaptivecomputing.com/resources/docs/maui/pbsintegration.php > > You should be able to use the base Maui setup with TORQUE. > > Hope that helps. > > > > ~Steve > >> >> Regards, >> >> ~Hak >> >> >> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > > ---------------------- > Steve Crusan > System Administrator > Center for Research Computing > University of Rochester > https://www.crc.rochester.edu/ > > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG/MacGPG2 v2.0.17 (Darwin) > Comment: GPGTools - http://gpgtools.org > > iQEcBAEBAgAGBQJOl2qHAAoJENS19LGOpgqK/CUH/0xmHHHhXYq0AHpM7FqP7d5L > xQRtNpbxlVejU68XFfxM7lHA/pp6lzb/niFBZG4ujHMofv84qKnAq7vFBskgIhKm > 9AtTU+W3PkkjdHWS7lhYkSt0Wun+k0te5TMu/QfBfKpLTioEU0SqlHG+RhGgaWcn > ow5/+TXN/tQCTayaZko7m/VbOcFv258B1lqEQFwczf7KgUcUDKsYc27lZlxG3IGj > 20CGHGytSueGwIv1YdD9QRo7zFXMy49keN7z8nDaasDuBCMtsBpJrX2Xt77zvfcY > ncyWknmd9+hgiAenLptFjm3T6Mt7Q+BDEWCT58buUQtgODHMiRMIHEwhTW1WE3w= > =uEFl > -----END PGP SIGNATURE----- > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From nicolas.ferre at univ-provence.fr Sun Oct 16 01:52:34 2011 From: nicolas.ferre at univ-provence.fr (=?ISO-8859-1?Q?Nicolas_Ferr=E9?=) Date: Sun, 16 Oct 2011 09:52:34 +0200 Subject: [torqueusers] "Cloned" job Message-ID: Dear Torque users, For some days, a weird thing happens quite often on our cluster, running torque + maui. Some single-node jobs are launched two or three times, ie "clones" are running simultaneously on different nodes. Example: job 61213 > qstat -n1|grep 61213 61213.slater.up. fredbip7 journey opt_part_anionMQ 629 1 4 4gb 48:00 R 44:18 lame10/3+lame10/2+lame10/1+lame10/0 > pbsnodes -a|grep 61213 (or better pestat |grep 61213) lame7 busy* 11* 24103 8 32104 1285 4/2 8 0 61215 fredbip7 61212 fredbip7 61213 fredbip7 lame9 free 7* 24103 8 32104 1190 3/2 4 0 61214 fredbip7 61213 fredbip7 lame10 busy* 11* 24103 8 32104 1318 4/3 8 0 61215 fredbip7 61213 fredbip7 61289 stereo > checkjob 61213 checking job 61213 State: Running Creds: user:fredbip7 group:fredbip7 class:journey qos:DEFAULT WallTime: 1:20:22:15 of 2:00:00:00 SubmitTime: Fri Oct 14 13:23:40 (Time Queued Total: 00:02:21 Eligible: 00:01:04) StartTime: Fri Oct 14 13:26:01 Total Tasks: 4 Req[0] TaskCount: 4 Partition: DEFAULT Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [NONE] Dedicated Resources Per Task: PROCS: 1 MEM: 1024M Allocated Nodes: [lame10:4] IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 3 PartitionMask: [ALL] Flags: RESTARTABLE Reservation '61213' (-1:20:21:46 -> 3:38:14 Duration: 2:00:00:00) Messages: cannot start job - RM failure, rc: 15084, msg: 'Premature end of message' PE: 4.00 StartPriority: 1 I can see this strange message "premature end of message", but looking at torque doc it does not seem to be related to the problem. Any idea? Nicolas Ferr? Laboratoire Chimie Provence Aix-Marseille Universit?, France Phone : +33 413550532 From fabien.archambault at univ-provence.fr Fri Oct 14 07:16:26 2011 From: fabien.archambault at univ-provence.fr (Fabien Archambault) Date: Fri, 14 Oct 2011 15:16:26 +0200 Subject: [torqueusers] Job starting on two nodes only one node asked and seen in qstat Message-ID: <4E98362A.3080901@univ-provence.fr> Dear list, I have an issue on a cluster I am administrating. It does not appear always but sometimes it is present on many jobs. The problem is that one job is sent and seen from qstat point of view only on the lame6 (in the example I give) but the job is also started on lame5. Here is a recent example I had and some output. If you need more please do not hesitate to ask. Maui version 3.3.1 # /usr/local/sbin/pbs_server -v version: 2.5.7 The job is: # qstat -f 61209 Job Id: 61209.master.up.univ-mrs.fr Job_Name = Ru-41-b3lypMKvdw Job_Owner = biosci at master.up.univ-mrs.fr resources_used.cput = 06:44:35 resources_used.mem = 206392kb resources_used.vmem = 6091668kb resources_used.walltime = 01:43:10 job_state = R queue = journey server = master.up.univ-mrs.fr Checkpoint = u ctime = Fri Oct 14 13:20:46 2011 Error_Path = master.up.univ-mrs.fr:/home/biosci/Michel/laccase/Ru-41-b3lyp MKvdw.e61209 exec_host = lame6/3+lame6/2+lame6/1+lame6/0 Hold_Types = n Join_Path = oe Keep_Files = n Mail_Points = a mtime = Fri Oct 14 13:22:08 2011 Output_Path = master.up.univ-mrs.fr:/home/biosci/Michel/laccase/Ru-41-b3ly pMKvdw.o61209 Priority = 0 qtime = Fri Oct 14 13:20:46 2011 Rerunable = True Resource_List.mem = 4gb Resource_List.neednodes = 1:ppn=4 Resource_List.nice = 10 Resource_List.nodect = 1 Resource_List.nodes = 1:ppn=4 Resource_List.other = disk=200gb Resource_List.walltime = 48:00:00 session_id = 20049 Shell_Path_List = /bin/sh substate = 42 Variable_List = PBS_O_QUEUE=defaut,PBS_O_HOME=/home/biosci, PBS_O_LANG=fr_FR.UTF-8,PBS_O_LOGNAME=biosci, PBS_O_PATH=/home/biosci/COSMOlogic10/TURBOMOLE/bin/em64t-unknown-linu x-gnu:/home/biosci/COSMOlogic10/TURBOMOLE/scripts:/usr/local/maui/bin: /usr/local/Mercury_2.4/bin:/usr/local/Ampac-9.2/exec:/usr/local/Ampac- 9.1/exec:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/home/biosci/b in,PBS_O_MAIL=/var/spool/mail/biosci,PBS_O_SHELL=/bin/bash, PBS_O_HOST=master.up.univ-mrs.fr,PBS_SERVER=master, PBS_O_WORKDIR=/home/biosci/Michel/laccase euser = biosci egroup = biosci hashname = 61209.master.up.univ-mrs.fr queue_rank = 7385 queue_type = E etime = Fri Oct 14 13:20:46 2011 submit_args = /home/biosci/Michel/laccase/Ru-41-b3lypMKvdw.job start_time = Fri Oct 14 13:21:17 2011 Walltime.Remaining = 166545 start_count = 2 fault_tolerant = False submit_host = master.up.univ-mrs.fr init_work_dir = /home/biosci/Michel/laccase # diagnose -j 61209 Name State Par Proc QOS WCLimit R Min User Group Account QueuedTime Network Opsys Arch Mem Disk Procs Class Features 61209 Running DEF 4 DEF 2:00:00:00 1 4 biosci biosci - 00:15:20 [NONE] [NONE] [NONE] >=0 >=0 NC0 [journey:1] [NONE] WARNING: job '61209' utilizes more procs than dedicated (3.92 > 1) #pestat ... lame5 busy* 11* 24103 8 32104 1333 4/3 8 0 61208 biosci 61209 biosci 61210 fredbip7 lame6 excl 8 24103 8 32104 1193 3/3 8 0 61209 biosci 61211 fredbip7 ... On the master: # tracejob -q 61209 Job: 61209.master.up.univ-mrs.fr 10/14/2011 13:20:46 S enqueuing into defaut, state 1 hop 1 10/14/2011 13:20:46 S dequeuing from defaut, state QUEUED 10/14/2011 13:20:46 S enqueuing into journey, state 1 hop 1 10/14/2011 13:20:46 S Job Queued at request of biosci at master.up.univ-mrs.fr, owner = biosci at master.up.univ-mrs.fr, job name = Ru-41-b3lypMKvdw, queue = journey 10/14/2011 13:20:46 A queue=defaut 10/14/2011 13:20:46 A queue=journey 10/14/2011 13:20:57 S Job Run at request of root at master.up.univ-mrs.fr 10/14/2011 13:21:07 S unable to run job, MOM rejected/timeout 10/14/2011 13:21:17 S Job Run at request of root at master.up.univ-mrs.fr 10/14/2011 13:21:22 A user=biosci group=biosci jobname=Ru-41-b3lypMKvdw queue=journey ctime=1318591246 qtime=1318591246 etime=1318591246 start=1318591282 owner=biosci at master.up.univ-mrs.fr exec_host=lame6/3+lame6/2+lame6/1+lame6/0 Resource_List.mem=4gb Resource_List.neednodes=1:ppn=4 Resource_List.nice=10 Resource_List.nodect=1 Resource_List.nodes=1:ppn=4 Resource_List.other=disk=200gb Resource_List.walltime=48:00:00 10/14/2011 13:22:08 S Not sending email: User does not want mail of this type. 10/14/2011 14:51:48 S enqueuing into journey, state 4 hop 1 10/14/2011 14:51:48 S Requeueing job, substate: 42 Requeued in queue: journey On lame5: # tracejob -q 61209 Job: 61209.master.up.univ-mrs.fr 10/14/2011 13:21:02 M evaluating limits for job 10/14/2011 13:21:02 M about to fork child which will become job 10/14/2011 13:21:02 M phase 2 of job launch successfully completed 10/14/2011 13:21:07 M job not ready after 5 second timeout, MOM will recheck 10/14/2011 13:21:08 M checking job start in TMOMScanForStarting - examining pipe from child 10/14/2011 13:21:09 M job 61209.master.up.univ-mrs.fr child not started, will check later 10/14/2011 13:21:40 M task/session info loaded 10/14/2011 13:21:40 M job 61209.master.up.univ-mrs.fr reported successful start On lame6: # tracejob -q 61209 Job: 61209.master.up.univ-mrs.fr 10/14/2011 13:21:17 M evaluating limits for job 10/14/2011 13:21:17 M about to fork child which will become job 10/14/2011 13:21:17 M phase 2 of job launch successfully completed 10/14/2011 13:21:22 M job not ready after 5 second timeout, MOM will recheck 10/14/2011 13:21:22 M checking job start in TMOMScanForStarting - examining pipe from child 10/14/2011 13:21:23 M job 61209.master.up.univ-mrs.fr child not started, will check later 10/14/2011 13:21:55 M task/session info loaded 10/14/2011 13:21:55 M job 61209.master.up.univ-mrs.fr reported successful start 10/14/2011 13:22:08 M encoding "send flagged" attr: session_id Thanks for your help, Fabien From zhaohscas at yahoo.com.cn Mon Oct 17 22:38:17 2011 From: zhaohscas at yahoo.com.cn (Hongsheng Zhao) Date: Tue, 18 Oct 2011 12:38:17 +0800 Subject: [torqueusers] An issue when using pbs script to invoke different cpus from different nodes. Message-ID: <4E9D02B9.5000105@yahoo.com.cn> Hi all, I use qsub to submit the job to my queue. Currently I've the following lines in the pbs script invoked by qsub: -------- #PBS -l nodes=2:ppn=8 #PBS -l walltime=99:00:00 #PBS -j oe #PBS -o out #PBS -e err #PBS -V #PBS -q default ---------- As you can see, in the above example, the job will use 16 cpus equally supplied by two nodes. But now, I want to let pbs assign the cpus and nodes to this job according to the following requirements: 1- There are 8 cpus used for this job. 2- All of the these cpus may belong to one node, or can come from different nodes, say, supplied by two/three/four nodes and so on. Could you please give me some hints on this issue. Thanks in advance. Regards -- Hongsheng Zhao School of Physics and Electrical Information Science, Ningxia University, Yinchuan 750021, China From knielson at adaptivecomputing.com Tue Oct 18 08:19:17 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Tue, 18 Oct 2011 08:19:17 -0600 (MDT) Subject: [torqueusers] An issue when using pbs script to invoke different cpus from different nodes. In-Reply-To: <4E9D02B9.5000105@yahoo.com.cn> Message-ID: ----- Original Message ----- > From: "Hongsheng Zhao" > To: torqueusers at supercluster.org > Sent: Monday, October 17, 2011 10:38:17 PM > Subject: [torqueusers] An issue when using pbs script to invoke different cpus from different nodes. > > Hi all, > > I use qsub to submit the job to my queue. Currently I've the > following > lines in the pbs script invoked by qsub: > > -------- > #PBS -l nodes=2:ppn=8 > #PBS -l walltime=99:00:00 > #PBS -j oe > #PBS -o out > #PBS -e err > #PBS -V > #PBS -q default > ---------- > > As you can see, in the above example, the job will use 16 cpus > equally > supplied by two nodes. But now, I want to let pbs assign the cpus > and > nodes to this job according to the following requirements: > > 1- There are 8 cpus used for this job. > 2- All of the these cpus may belong to one node, or can come from > different nodes, say, supplied by two/three/four nodes and so on. > > Could you please give me some hints on this issue. Thanks in > advance. > If you do not care what nodes the processors come from you could use -l procs=8. The procs option tells the scheduler to assign 8 processors from where ever it can find them. Ken From sefa at ulakbim.gov.tr Tue Oct 18 08:29:13 2011 From: sefa at ulakbim.gov.tr (Sefa Arslan) Date: Tue, 18 Oct 2011 17:29:13 +0300 Subject: [torqueusers] An issue when using pbs script to invoke different cpus from different nodes. In-Reply-To: <4E9D02B9.5000105@yahoo.com.cn> References: <4E9D02B9.5000105@yahoo.com.cn> Message-ID: <4E9D8D39.3000700@ulakbim.gov.tr> just use nodes instead of nodes+ppn #PBS -l nodes=8 #PBS -l walltime=99:00:00 #PBS -j oe #PBS -o out #PBS -e err #PBS -V #PBS -q default In this case, the cpus are supplied from different nodes (not have to be supplied from 8 different nodes, number of nodes depends on the availability of the cluster) Sefa Arslan On 10/18/2011 07:38 AM, Hongsheng Zhao wrote: > Hi all, > > I use qsub to submit the job to my queue. Currently I've the following > lines in the pbs script invoked by qsub: > > -------- > #PBS -l nodes=2:ppn=8 > #PBS -l walltime=99:00:00 > #PBS -j oe > #PBS -o out > #PBS -e err > #PBS -V > #PBS -q default > ---------- > > As you can see, in the above example, the job will use 16 cpus equally > supplied by two nodes. But now, I want to let pbs assign the cpus and > nodes to this job according to the following requirements: > > 1- There are 8 cpus used for this job. > 2- All of the these cpus may belong to one node, or can come from > different nodes, say, supplied by two/three/four nodes and so on. > > Could you please give me some hints on this issue. Thanks in advance. > > Regards > From kenneth at sdsc.edu Tue Oct 18 11:18:17 2011 From: kenneth at sdsc.edu (Kenneth Yoshimoto) Date: Tue, 18 Oct 2011 10:18:17 -0700 (PDT) Subject: [torqueusers] cpusets memory nodes? Message-ID: Hi, Does anyone know if Torque cpusets assigns memory nodes to jobs as well as cpus? Or is the NUMA capability the right thing to use here for a cluster with multiple NUMA nodes? Thanks, Kenneth From tbaer at utk.edu Tue Oct 18 12:39:55 2011 From: tbaer at utk.edu (Troy Baer) Date: Tue, 18 Oct 2011 14:39:55 -0400 Subject: [torqueusers] cpusets memory nodes? In-Reply-To: References: Message-ID: <1318963195.27239.7.camel@ashitaka.baer.lan> On Tue, 2011-10-18 at 10:18 -0700, Kenneth Yoshimoto wrote: > Does anyone know if Torque cpusets assigns memory nodes > to jobs as well as cpus? Or is the NUMA capability the > right thing to use here for a cluster with multiple > NUMA nodes? AFAIK, TORQUE doesn't do anything with the mems field in a cpuset unless you're using a NUMA build of 3.0.x or later. I believe your scheduler also needs to be aware of such things, and the only one I know of that is currently thus aware is Moab. --Troy From kenneth at sdsc.edu Tue Oct 18 13:17:52 2011 From: kenneth at sdsc.edu (Kenneth Yoshimoto) Date: Tue, 18 Oct 2011 12:17:52 -0700 (PDT) Subject: [torqueusers] cpusets memory nodes? In-Reply-To: <1318963195.27239.7.camel@ashitaka.baer.lan> References: <1318963195.27239.7.camel@ashitaka.baer.lan> Message-ID: Troy, Do you know if I can have multiple NUMA nodes configured in Torque, or is it only taking a single MOM in that case? Kenneth On Tue, 18 Oct 2011, Troy Baer wrote: > Date: Tue, 18 Oct 2011 14:39:55 -0400 > From: Troy Baer > Reply-To: Torque Users Mailing List > To: Kenneth Yoshimoto , > Torque Users Mailing List > Subject: Re: [torqueusers] cpusets memory nodes? > > On Tue, 2011-10-18 at 10:18 -0700, Kenneth Yoshimoto wrote: >> Does anyone know if Torque cpusets assigns memory nodes >> to jobs as well as cpus? Or is the NUMA capability the >> right thing to use here for a cluster with multiple >> NUMA nodes? > > AFAIK, TORQUE doesn't do anything with the mems field in a cpuset unless > you're using a NUMA build of 3.0.x or later. I believe your scheduler > also needs to be aware of such things, and the only one I know of that > is currently thus aware is Moab. > > --Troy > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From tbaer at utk.edu Tue Oct 18 13:28:13 2011 From: tbaer at utk.edu (Troy Baer) Date: Tue, 18 Oct 2011 15:28:13 -0400 Subject: [torqueusers] cpusets memory nodes? In-Reply-To: References: <1318963195.27239.7.camel@ashitaka.baer.lan> Message-ID: <1318966093.27239.26.camel@ashitaka.baer.lan> On Tue, 2011-10-18 at 12:17 -0700, Kenneth Yoshimoto wrote: > Do you know if I can have multiple NUMA nodes configured > in Torque, or is it only taking a single MOM in that case? Each host will run a single pbs_mom daemon that will report itself as multiple virtual nodes, one per NUMA node. At one point, I had a test system configured with three hosts, each of which had 4 NUMA nodes; when you ran pbsnodes -a on it, it looked like 12 nodes. The scheduler has to know how to glue these virtual nodes back together, which IIRC Moab does with its nodeset functionality. --Troy From kenneth at sdsc.edu Tue Oct 18 13:32:24 2011 From: kenneth at sdsc.edu (Kenneth Yoshimoto) Date: Tue, 18 Oct 2011 12:32:24 -0700 (PDT) Subject: [torqueusers] cpusets memory nodes? In-Reply-To: <1318966093.27239.26.camel@ashitaka.baer.lan> References: <1318963195.27239.7.camel@ashitaka.baer.lan> <1318966093.27239.26.camel@ashitaka.baer.lan> Message-ID: Hmmm, in that scenario, I would want it to look like 3 nodes, pbsnodes -a reporting 3 nodes, but with cpuset enforcement within each node. On Tue, 18 Oct 2011, Troy Baer wrote: > Date: Tue, 18 Oct 2011 15:28:13 -0400 > From: Troy Baer > To: Kenneth Yoshimoto , > Torque Users Mailing List > Subject: Re: [torqueusers] cpusets memory nodes? > > On Tue, 2011-10-18 at 12:17 -0700, Kenneth Yoshimoto wrote: >> Do you know if I can have multiple NUMA nodes configured >> in Torque, or is it only taking a single MOM in that case? > > Each host will run a single pbs_mom daemon that will report itself as > multiple virtual nodes, one per NUMA node. At one point, I had a test > system configured with three hosts, each of which had 4 NUMA nodes; when > you ran pbsnodes -a on it, it looked like 12 nodes. The scheduler has > to know how to glue these virtual nodes back together, which IIRC Moab > does with its nodeset functionality. > > --Troy > > From zhaohscas at yahoo.com.cn Tue Oct 18 23:22:08 2011 From: zhaohscas at yahoo.com.cn (Hongsheng Zhao) Date: Wed, 19 Oct 2011 13:22:08 +0800 Subject: [torqueusers] Comment out the line #PBS -l nodes=1:ppn=8 in pbs script. Message-ID: <4E9E5E80.2050509@yahoo.com.cn> Hi all, I want to comment out the line #PBS -l nodes=1:ppn=8. Does the following line do the trick, ##PBS -l nodes=1:ppn=8 Regards -- Hongsheng Zhao School of Physics and Electrical Information Science, Ningxia University, Yinchuan 750021, China From sabujp at gmail.com Tue Oct 18 23:37:19 2011 From: sabujp at gmail.com (Sabuj Pattanayek) Date: Wed, 19 Oct 2011 00:37:19 -0500 Subject: [torqueusers] Comment out the line #PBS -l nodes=1:ppn=8 in pbs script. In-Reply-To: <4E9E5E80.2050509@yahoo.com.cn> References: <4E9E5E80.2050509@yahoo.com.cn> Message-ID: yes On Wed, Oct 19, 2011 at 12:22 AM, Hongsheng Zhao wrote: > Hi all, > > I want to comment out the line #PBS -l nodes=1:ppn=8. ?Does the > following line do the trick, > > ##PBS -l nodes=1:ppn=8 > > Regards > -- > Hongsheng Zhao > School of Physics and Electrical Information Science, > Ningxia University, Yinchuan 750021, China > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From zhaohscas at yahoo.com.cn Tue Oct 18 23:42:38 2011 From: zhaohscas at yahoo.com.cn (Hongsheng Zhao) Date: Wed, 19 Oct 2011 13:42:38 +0800 Subject: [torqueusers] An issue when using pbs script to invoke different cpus from different nodes. In-Reply-To: References: Message-ID: <4E9E634E.2000306@yahoo.com.cn> On 10/18/2011 10:19 PM, Ken Nielson wrote: > > > ----- Original Message ----- >> From: "Hongsheng Zhao" >> To: torqueusers at supercluster.org >> Sent: Monday, October 17, 2011 10:38:17 PM >> Subject: [torqueusers] An issue when using pbs script to invoke different cpus from different nodes. >> >> Hi all, >> >> I use qsub to submit the job to my queue. Currently I've the >> following >> lines in the pbs script invoked by qsub: >> >> -------- >> #PBS -l nodes=2:ppn=8 >> #PBS -l walltime=99:00:00 >> #PBS -j oe >> #PBS -o out >> #PBS -e err >> #PBS -V >> #PBS -q default >> ---------- >> >> As you can see, in the above example, the job will use 16 cpus >> equally >> supplied by two nodes. But now, I want to let pbs assign the cpus >> and >> nodes to this job according to the following requirements: >> >> 1- There are 8 cpus used for this job. >> 2- All of the these cpus may belong to one node, or can come from >> different nodes, say, supplied by two/three/four nodes and so on. >> >> Could you please give me some hints on this issue. Thanks in >> advance. >> > > If you do not care what nodes the processors come from you could use -l procs=8. > > The procs option tells the scheduler to assign 8 processors from where ever it can find them. According to you above notes, I've changed the pbs script snippet into the following form: ------------ ##PBS -l nodes=1:ppn=8 #PBS -l procs=35 #PBS -l walltime=99:00:00 #PBS -j oe #PBS -o out #PBS -e err #PBS -V #PBS -q default ------------- In the defaults queue, there are totally have 11 nodes with only one node has 16 cores and all the other nodes have 8 cores. Then I run the vasp job via the above pbs script, and find the following output in the very beginning of the OUTCAR file: ----------- vasp.4.6.35 3Apr08 complex executed on LinuxIFC date 2011.11.13 01:27:16 running on 1 nodes distr: one band on 1 nodes, 1 groups ----------- It looks like that this job only use one node. Could you please give me some more hints? Thanks in advance. For your information, I also give the complete content of my pbs scipt in the following: **************** pbs script beginning from here ********************* zhaohongsheng at node32:~/work/work3/afm> cat vasp.job #!/bin/bash # ##PBS -l nodes=1:ppn=8 #PBS -l procs=35 #PBS -l walltime=99:00:00 #PBS -j oe #PBS -o out #PBS -e err #PBS -V #PBS -q default source /public/software/intel/Compiler/11.1/059/bin/intel64/ifortvars_intel64.sh source /public/software/intel/mkl/bin/intel64/mklvars_intel64.sh source /public/software/intel/mpi/intel64/bin/mpivars.sh # go to work dir cd $PBS_O_WORKDIR # The program we want to execute (modify to suit your setup) EXEC=/public/software/vasp4.6 #EXEC=/share/apps/vasp/bin/vasp52_mkl1023029_impi322006 # setup mpd env (Of course use some other secret word than "dfadfs") if [ ! -f ~/.mpd.conf ]; then /bin/echo "secretword=dfadfs" >> ~/.mpd.conf /bin/chmod 600 ~/.mpd.conf fi ########################################################## # The following should be no need to # change any of these settings for normal use. ########################################################## # Intel MPI Home MPI_HOME=/public/software/intel/mpi/intel64/bin # setup Nums of Processor NP=`cat $PBS_NODEFILE|wc -l` echo "Numbers of Processors: $NP" echo "---------------------------" # Number of MPD N_MPD=`cat $PBS_NODEFILE|uniq|wc -l` echo "started mpd Number: $N_MPD" echo "---------------------------" # setup mpi env (em64t) $MPI_HOME/mpdboot -r ssh -n $N_MPD -f $PBS_NODEFILE # running program $MPI_HOME/mpiexec -genv I_MPI_DEBUG 3 -genv I_MPI_DEVICE ssm -n $NP $EXEC # clean $MPI_HOME/mpdallexit **************** pbs script ended here ********************* Regards. > > Ken > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Hongsheng Zhao School of Physics and Electrical Information Science, Ningxia University, Yinchuan 750021, China From rcbord at vims.edu Wed Oct 19 02:34:01 2011 From: rcbord at vims.edu (Ralph C. Bording) Date: Wed, 19 Oct 2011 08:34:01 +0000 Subject: [torqueusers] An issue when using pbs script to invoke different cpus from different nodes. In-Reply-To: <4E9D02B9.5000105@yahoo.com.cn> References: <4E9D02B9.5000105@yahoo.com.cn> Message-ID: <21B0427E-CB1A-4A46-8D85-C667EBDF2E49@vims.edu> Hey, you can check to see which nodes (resources) are being use a couple of different ways. Try "qstat -n" while the job is running. You can also use "pbsnodes -a" if you know the PBS jobid you can look to see which nodes the job is running on that way too. To try that use qstat -u username That will get you the jobid. pbsnodes -a | grep jobid will get you something close to qstat -n. Cheers Chris Sent from my iPhone On Oct 18, 2011, at 7:12 AM, "Hongsheng Zhao" wrote: > Hi all, > > I use qsub to submit the job to my queue. Currently I've the following > lines in the pbs script invoked by qsub: > > -------- > #PBS -l nodes=2:ppn=8 > #PBS -l walltime=99:00:00 > #PBS -j oe > #PBS -o out > #PBS -e err > #PBS -V > #PBS -q default > ---------- > > As you can see, in the above example, the job will use 16 cpus equally > supplied by two nodes. But now, I want to let pbs assign the cpus and > nodes to this job according to the following requirements: > > 1- There are 8 cpus used for this job. > 2- All of the these cpus may belong to one node, or can come from > different nodes, say, supplied by two/three/four nodes and so on. > > Could you please give me some hints on this issue. Thanks in advance. > > Regards > -- > Hongsheng Zhao > School of Physics and Electrical Information Science, > Ningxia University, Yinchuan 750021, China > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From lloyd_brown at byu.edu Wed Oct 19 10:27:09 2011 From: lloyd_brown at byu.edu (Lloyd Brown) Date: Wed, 19 Oct 2011 10:27:09 -0600 Subject: [torqueusers] 2.5.8 PAM module fails In-Reply-To: <382b132e-738d-49d2-a0e6-e2ed286cf59a@mail> References: <382b132e-738d-49d2-a0e6-e2ed286cf59a@mail> Message-ID: <4E9EFA5D.8000904@byu.edu> FWIW, we also saw the same symbol problem on our CentOS/NPACI-Rocks 5 cluster, with 2.5.8. I'm currently in the process of reverting our MOMs to 2.5.5, and will work on figuring out what went wrong, why it didn't show up in our testing, etc. Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu On 10/06/2011 03:43 PM, Ken Nielson wrote: > I have tried this on CentOS 5 and Ubuntu 11.04 without any problem. I am building from the 2.5-fixes branch in Subversion. Is anyone else having an issue compiling PAM on TORQUE with 2.5.8? > > Maybe another good question is "is anyone using PAM with 2.5.8"? > From lloyd_brown at byu.edu Wed Oct 19 10:57:58 2011 From: lloyd_brown at byu.edu (Lloyd Brown) Date: Wed, 19 Oct 2011 10:57:58 -0600 Subject: [torqueusers] 2.5.8 PAM module fails In-Reply-To: <4E9EFA5D.8000904@byu.edu> References: <382b132e-738d-49d2-a0e6-e2ed286cf59a@mail> <4E9EFA5D.8000904@byu.edu> Message-ID: <4E9F0196.8010104@byu.edu> Okay. Apparently I just wasn't paying attention during testing to this little detail. It was happening. With v2.5.8, I get this in the log: > Oct 19 10:53:04 m5stage-1-4 sshd[5499]: PAM unable to dlopen(/lib64/security/pam_pbssimpleauth.so) > Oct 19 10:53:04 m5stage-1-4 sshd[5499]: PAM [error: /lib64/security/pam_pbssimpleauth.so: undefined symbol: getpwnam_ext] > Oct 19 10:53:04 m5stage-1-4 sshd[5499]: PAM adding faulty module: /lib64/security/pam_pbssimpleauth.so > Oct 19 10:53:07 m5stage-1-4 sshd[5499]: pam_access(sshd:account): access denied for user `lbrown' from `OBSCUREDHOSTNAME' > Oct 19 10:53:07 m5stage-1-4 sshd[5499]: Failed password for lbrown from OBSCUREDIPADDR port 48572 ssh2 > Oct 19 10:53:07 m5stage-1-4 sshd[5500]: fatal: Access denied for user lbrown by PAM account configuration But with v2.5.7, on the same host, with the same libraries, etc., and same user, I get this instead: > Oct 19 10:53:35 m5stage-1-4 pam_pbssimpleauth[8797]: opening /usr/spool/PBS/mom_priv/jobs > Oct 19 10:53:35 m5stage-1-4 pam_pbssimpleauth[8797]: username lbrown, known > Oct 19 10:53:35 m5stage-1-4 pam_pbssimpleauth[8797]: returning failed > Oct 19 10:53:35 m5stage-1-4 sshd[8797]: pam_access(sshd:account): access denied for user `lbrown' from `OBSCUREDHOSTNAME' > Oct 19 10:53:35 m5stage-1-4 sshd[8797]: Failed password for lbrown from OBSCUREDIPADDR port 48575 ssh2 > Oct 19 10:53:35 m5stage-1-4 sshd[8798]: fatal: Access denied for user lbrown by PAM account configuration Ken, you say that you have it working on a CentOS 5 instance with the released v2.5.8? Can you share the exact configure line you used, so we can confirm/deny this independently? FWIW, here's mine (I know at least one or two options are depricated): > CC=gcc CXX=g++ F77=gfortran ./configure --disable-server --enable-mom --enable-clients --disable-gui --with-server-home=/usr/spool/PBS --disable-sched --with-maildomain=users.fsl.byu.edu --with-pam --with-default-server=fslsched-stage.fsl.byu.edu --with-tcp-retry-limit=2 Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu On 10/19/2011 10:27 AM, Lloyd Brown wrote: > FWIW, we also saw the same symbol problem on our CentOS/NPACI-Rocks 5 > cluster, with 2.5.8. I'm currently in the process of reverting our MOMs > to 2.5.5, and will work on figuring out what went wrong, why it didn't > show up in our testing, etc. > > Lloyd Brown > Systems Administrator > Fulton Supercomputing Lab > Brigham Young University > http://marylou.byu.edu > > On 10/06/2011 03:43 PM, Ken Nielson wrote: >> I have tried this on CentOS 5 and Ubuntu 11.04 without any problem. I am building from the 2.5-fixes branch in Subversion. Is anyone else having an issue compiling PAM on TORQUE with 2.5.8? >> >> Maybe another good question is "is anyone using PAM with 2.5.8"? >> > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From leggett at mcs.anl.gov Wed Oct 19 15:09:48 2011 From: leggett at mcs.anl.gov (Ti Leggett) Date: Wed, 19 Oct 2011 16:09:48 -0500 Subject: [torqueusers] Running torque with iptables Message-ID: <62D2F802-8A5B-496A-86C8-61A7505C5122@mcs.anl.gov> We're rolling out locking down machines much more tightly using iptables after a security incident. I've read the documentation and I have tcp/udp 15001 and tcp 15004 open on the PBS server, I have tcp 15002, tcp/udp 15003 and udp 0-1023 opened on the PBS MOMs and I have udp 0-1023 on the submit hosts. However it seems the MOM superior is trying to talk back to the submit host on tcp ephemeral ports >1024. Is there any way to restrict the range of those ports it's trying to use so that I can open those up appropriately, or am I going to have to take the (undesired) route of opening everything up between the MOMs and submit hosts? -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 163 bytes Desc: Message signed with OpenPGP using GPGMail Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20111019/928948e7/attachment.bin From matt.morabito at gmail.com Wed Oct 19 15:23:37 2011 From: matt.morabito at gmail.com (Matthew Morabito) Date: Wed, 19 Oct 2011 17:23:37 -0400 Subject: [torqueusers] Stacking jobs using Maui/Torque Message-ID: Crossposting this to torqueusers, I originally sent to mauiusers: I'm using maui v. 3.3.1 with pbs/torque v.2.4.11 and it seems maui/torque are refusing to share different jobs on the same nodes (from maui.log): 10/12 12:13:59 INFO: located resources for 64 tasks (76) in best partition DEFAULT for job 1519 at time 00:00:00 10/12 12:13:59 ERROR: invalid nodelist for job 1519:0 (inadequate nodecount, 29 < 32) 10/12 12:13:59 WARNING: cannot allocate tasks for job 1519 at 00:00:00 There are clearly enough processors (76) free for the job (which needs 64, at 32 nodes/2 processors per node) and then maui decides that it can't start despite having enough processors to do so. Where is the configuration for allowing the stacking of jobs in maui, and is there something I can change in the torque nodefile that can increase the max jobs for each node? My maui.cfg has: BACKFILLPOLICY FIRSTFIT RESERVATIONPOLICY CURRENTHIGHEST RESERVATIONDEPTH = 20 NODEAVAILABILITYPOLICY UTILIZED NODEACCESSPOLICY SHARED NODEALLOCATIONPOLICY CPULOAD #also has a line for each node: NODECFG[node01] MAXJOB=2 ... etc My /var/lib/torque/server_priv/nodes file looks like (I have nodes with 2 and 4 processors): node01 np=2 node02 np=2 .... node30 np=4 node31 np=4 Any help would be greatly appreciated. Thanks, Matthew -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111019/bd7648ba/attachment.html From knielson at adaptivecomputing.com Thu Oct 20 13:55:26 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Thu, 20 Oct 2011 13:55:26 -0600 (MDT) Subject: [torqueusers] Running torque with iptables In-Reply-To: <62D2F802-8A5B-496A-86C8-61A7505C5122@mcs.anl.gov> Message-ID: ----- Original Message ----- > From: "Ti Leggett" > To: "Torque Users Mailing List" > Sent: Wednesday, October 19, 2011 3:09:48 PM > Subject: [torqueusers] Running torque with iptables > > We're rolling out locking down machines much more tightly using > iptables after a security incident. I've read the documentation and > I have tcp/udp 15001 and tcp 15004 open on the PBS server, I have > tcp 15002, tcp/udp 15003 and udp 0-1023 opened on the PBS MOMs and I > have udp 0-1023 on the submit hosts. However it seems the MOM > superior is trying to talk back to the submit host on tcp ephemeral > ports >1024. Is there any way to restrict the range of those ports > it's trying to use so that I can open those up appropriately, or am > I going to have to take the (undesired) route of opening everything > up between the MOMs and submit hosts? The MOMs should only communicate with pbs_server and the other MOMs. I do not believe they communicate with the submit hosts. Could you tell us more about your setup? Regards Ken From gabe at msi.umn.edu Thu Oct 20 13:58:14 2011 From: gabe at msi.umn.edu (Gabe Turner) Date: Thu, 20 Oct 2011 14:58:14 -0500 Subject: [torqueusers] Running torque with iptables In-Reply-To: References: <62D2F802-8A5B-496A-86C8-61A7505C5122@mcs.anl.gov> Message-ID: <20111020195814.GB26767@blackice.msi.umn.edu> On Thu, Oct 20, 2011 at 01:55:26PM -0600, Ken Nielson wrote: > The MOMs should only communicate with pbs_server and the other MOMs. I do > not believe they communicate with the submit hosts. Could you tell us > more about your setup? > I believe the moms do communicate with the submit host if you're running an interactive (qsub -I) job. At least that has been my experience. Thus we allow ports 1024 to 65536 between submit and compute. I'd also be interested in narrowing this range, if possible. -- Gabe Turner gabe at msi.umn.edu HPC Systems Administrator, University of Minnesota Supercomputing Institute http://www.msi.umn.edu From knielson at adaptivecomputing.com Thu Oct 20 14:14:50 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Thu, 20 Oct 2011 14:14:50 -0600 (MDT) Subject: [torqueusers] Running torque with iptables In-Reply-To: <20111020195814.GB26767@blackice.msi.umn.edu> Message-ID: ----- Original Message ----- > From: "Gabe Turner" > To: torqueusers at supercluster.org > Sent: Thursday, October 20, 2011 1:58:14 PM > Subject: Re: [torqueusers] Running torque with iptables > > On Thu, Oct 20, 2011 at 01:55:26PM -0600, Ken Nielson wrote: > > The MOMs should only communicate with pbs_server and the other > > MOMs. I do > > not believe they communicate with the submit hosts. Could you tell > > us > > more about your setup? > > > > I believe the moms do communicate with the submit host if you're > running an > interactive (qsub -I) job. At least that has been my experience. Thus > we > allow ports 1024 to 65536 between submit and compute. I'd also be > interested in narrowing this range, if possible. > > -- > Gabe Turner > gabe at msi.umn.edu > HPC Systems Administrator, > University of Minnesota > Supercomputing Institute > http://www.msi.umn.edu I had a feeling I would be wrong about that. Yes, the submit hosts to communicate with the MOMs on Interactive jobs. Ken From charles.johnson at accre.vanderbilt.edu Thu Oct 20 14:19:27 2011 From: charles.johnson at accre.vanderbilt.edu (Charles Johnson) Date: Thu, 20 Oct 2011 15:19:27 -0500 Subject: [torqueusers] mom ownership Message-ID: <1915A0BF-DFF4-461B-9E2F-6FEDC91C4AFE@accre.vanderbilt.edu> Are there circumstances in which a pbs_mom would be running as a user rather than root? ~Charles~ -- Charles Johnson, Vanderbilt University Advanced Computing Center for Research & Education From knielson at adaptivecomputing.com Thu Oct 20 14:24:16 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Thu, 20 Oct 2011 14:24:16 -0600 (MDT) Subject: [torqueusers] Running torque with iptables In-Reply-To: Message-ID: ----- Original Message ----- > From: "Ken Nielson" > To: "Torque Users Mailing List" > Sent: Thursday, October 20, 2011 2:14:50 PM > Subject: Re: [torqueusers] Running torque with iptables > > ----- Original Message ----- > > From: "Gabe Turner" > > To: torqueusers at supercluster.org > > Sent: Thursday, October 20, 2011 1:58:14 PM > > Subject: Re: [torqueusers] Running torque with iptables > > > > On Thu, Oct 20, 2011 at 01:55:26PM -0600, Ken Nielson wrote: > > > The MOMs should only communicate with pbs_server and the other > > > MOMs. I do > > > not believe they communicate with the submit hosts. Could you > > > tell > > > us > > > more about your setup? > > > > > > > I believe the moms do communicate with the submit host if you're > > running an > > interactive (qsub -I) job. At least that has been my experience. > > Thus > > we > > allow ports 1024 to 65536 between submit and compute. I'd also be > > interested in narrowing this range, if possible. > > > > -- > > Gabe Turner > > gabe at msi.umn.edu > > HPC Systems Administrator, > > University of Minnesota > > Supercomputing Institute > > http://www.msi.umn.edu > I just verified what ports interactive jobs were using from submit host to the MOM and they are dynamic. Ken From knielson at adaptivecomputing.com Thu Oct 20 14:25:48 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Thu, 20 Oct 2011 14:25:48 -0600 (MDT) Subject: [torqueusers] mom ownership In-Reply-To: <1915A0BF-DFF4-461B-9E2F-6FEDC91C4AFE@accre.vanderbilt.edu> Message-ID: <6d3019db-9481-4897-88af-ae45b345ef11@mail> The mom daemon should be running as root. When jobs are started by the mom the jobs run as the user. Ken ----- Original Message ----- > From: "Charles Johnson" > To: "Torque Users Mailing List" > Sent: Thursday, October 20, 2011 2:19:27 PM > Subject: [torqueusers] mom ownership > > Are there circumstances in which a pbs_mom would be running as a user > rather than root? > > ~Charles~ > -- > Charles Johnson, Vanderbilt University > Advanced Computing Center for Research & Education > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From charles.johnson at accre.vanderbilt.edu Thu Oct 20 14:28:04 2011 From: charles.johnson at accre.vanderbilt.edu (Charles Johnson) Date: Thu, 20 Oct 2011 15:28:04 -0500 Subject: [torqueusers] mom ownership In-Reply-To: <6d3019db-9481-4897-88af-ae45b345ef11@mail> References: <6d3019db-9481-4897-88af-ae45b345ef11@mail> Message-ID: <4E1BEAAD-1706-4F18-BAC3-2E4CA635E4F9@accre.vanderbilt.edu> On Oct 20, 2011, at 3:25 PM, Ken Nielson wrote: > The mom daemon should be running as root. When jobs are started by the mom the jobs run as the user. As I thought. We found nodes with pbs_mom's running as a user rather than root. Something for us to track down. Thanks, again. ~Charles~ -- Charles Johnson, Vanderbilt University Advanced Computing Center for Research & Education From dbeer at adaptivecomputing.com Thu Oct 20 14:36:57 2011 From: dbeer at adaptivecomputing.com (David Beer) Date: Thu, 20 Oct 2011 14:36:57 -0600 (MDT) Subject: [torqueusers] mom ownership In-Reply-To: <4E1BEAAD-1706-4F18-BAC3-2E4CA635E4F9@accre.vanderbilt.edu> Message-ID: <67cf830e-f4a0-402a-b044-f984fa0922f0@mail> ----- Original Message ----- > On Oct 20, 2011, at 3:25 PM, Ken Nielson wrote: > > > The mom daemon should be running as root. When jobs are started by > > the mom the jobs run as the user. > > As I thought. We found nodes with pbs_mom's running as a user rather > than root. Something for us to track down. > > Thanks, again. > > ~Charles~ > -- > Charles Johnson, Vanderbilt University > Advanced Computing Center for Research & Education > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > Charles, What version of TORQUE are you running? Some have reported problems with TORQUE changing its id to a user and not changing back correctly. I believe that this problem was found in 2.5.6. I'm not the one who was working on this problem, so I'm not sure what the status of it is currently, but it could be you're experiencing a symptom of this bug. -- David Beer Direct Line: 801-717-3386 | Fax: 801-717-3738 Adaptive Computing 1712 S East Bay Blvd, Suite 300 Provo, UT 84606 From charles.johnson at accre.vanderbilt.edu Thu Oct 20 14:42:42 2011 From: charles.johnson at accre.vanderbilt.edu (Charles Johnson) Date: Thu, 20 Oct 2011 15:42:42 -0500 Subject: [torqueusers] mom ownership In-Reply-To: <67cf830e-f4a0-402a-b044-f984fa0922f0@mail> References: <67cf830e-f4a0-402a-b044-f984fa0922f0@mail> Message-ID: <22F469F2-712E-484B-B80F-948CEADC457D@accre.vanderbilt.edu> On Oct 20, 2011, at 3:36 PM, David Beer wrote: > What version of TORQUE are you running? We are running 2.5.8. ~Charles~ -- Charles Johnson, Vanderbilt University Advanced Computing Center for Research & Education Mailing Address: Peabody #34, 230 Appleton Place, Nashville, TN 37203 Shipping Address: 1231 18th Avenue South, Hill Center, Suite 143, Nashville, TN 37212 Office: 615-343-4134 Cell: 615-478-7788 Fax: 615-343-7216 charles.johnson at accre.vanderbilt.edu From sdowdy at ucar.edu Thu Oct 20 14:36:54 2011 From: sdowdy at ucar.edu (Stephen Dowdy) Date: Thu, 20 Oct 2011 14:36:54 -0600 Subject: [torqueusers] Running torque with iptables In-Reply-To: <20111020195814.GB26767@blackice.msi.umn.edu> References: <62D2F802-8A5B-496A-86C8-61A7505C5122@mcs.anl.gov> <20111020195814.GB26767@blackice.msi.umn.edu> Message-ID: <4EA08666.2040908@ucar.edu> Gabe Turner wrote, On 10/20/2011 01:58 PM: > On Thu, Oct 20, 2011 at 01:55:26PM -0600, Ken Nielson wrote: >> The MOMs should only communicate with pbs_server and the other MOMs. I do >> not believe they communicate with the submit hosts. Could you tell us >> more about your setup? >> > > I believe the moms do communicate with the submit host if you're running an > interactive (qsub -I) job. At least that has been my experience. Thus we > allow ports 1024 to 65536 between submit and compute. I'd also be > interested in narrowing this range, if possible. If it's indeed getting a port from the ephemeral pool *something* like: EPHEMERAL_PORTS=$(sed 's/[[:space:]]\+/:/' /proc/sys/net/ipv4/ip_local_port_range) EPHEMERAL_PORTS=${EPHEMERAL_PORTS:-"32768:61000"} for host in ${SUBMIT_HOSTS}; do iptables -A DEFAULT -p tcp -s ${network} --sport ${EPHEMERAL_PORTS} --dport ${EPHEMERAL_PORTS} -j ACCEPT done should do it. narrowing the range like this doesn't help a whole lot, but it's probably worthwhile anyway. --stephen -- Stephen Dowdy - Systems Administrator - NCAR/RAL 303.497.2869 - sdowdy at ucar.edu - http://www.ral.ucar.edu/~sdowdy/ From mej at lbl.gov Thu Oct 20 19:39:04 2011 From: mej at lbl.gov (Michael Jennings) Date: Thu, 20 Oct 2011 18:39:04 -0700 Subject: [torqueusers] Running torque with iptables In-Reply-To: <62D2F802-8A5B-496A-86C8-61A7505C5122@mcs.anl.gov> References: <62D2F802-8A5B-496A-86C8-61A7505C5122@mcs.anl.gov> Message-ID: <20111021013904.GH12776@lbl.gov> On Wednesday, 19 October 2011, at 16:09:48 (-0500), Ti Leggett wrote: > We're rolling out locking down machines much more tightly using > iptables after a security incident. I've read the documentation and > I have tcp/udp 15001 and tcp 15004 open on the PBS server, I have > tcp 15002, tcp/udp 15003 and udp 0-1023 opened on the PBS MOMs and I > have udp 0-1023 on the submit hosts. However it seems the MOM > superior is trying to talk back to the submit host on tcp ephemeral > ports 1024. Is there any way to restrict the range of those ports > it's trying to use so that I can open those up appropriately, or am > I going to have to take the (undesired) route of opening everything > up between the MOMs and submit hosts? In src/cmds/qsub.c, function interactive_port(), the following code determines that the port number will be arbitrary for the interactive job listener: myaddr.sin_port = 0; Two possible solutions here: If you know only 1 qsub -I will ever be running on a particular node at any one time, you can hardcode the port here by changing 0 to htons(12345) (or whatever port number you choose). The better solution is going to wrap the bind() in a for loop to try a range of port numbers consecutively until the bind() succeeds (or you run out of ports). for (port = LOW_PORT; port <= HIGH_PORT; port++) { myaddr.sin_port = htons(port); if (bind(*sock, (struct sockaddr *)&myaddr, namelen) >= 0) { break; } } if (port > HIGH_PORT) { perror("qsub: unable to bind to socket"); exit(1); } Something like that. If that works, you (or someone) might be inclined to add a configuration option to specify the port range. :-) HTH, Michael -- Michael Jennings Linux Systems and Cluster Engineer High-Performance Computing Services Bldg 50B-3209E W: 510-495-2687 MS 050C-3396 F: 510-486-8615 From glen.beane at gmail.com Thu Oct 20 20:04:40 2011 From: glen.beane at gmail.com (Glen Beane) Date: Thu, 20 Oct 2011 22:04:40 -0400 Subject: [torqueusers] mom ownership In-Reply-To: <1915A0BF-DFF4-461B-9E2F-6FEDC91C4AFE@accre.vanderbilt.edu> References: <1915A0BF-DFF4-461B-9E2F-6FEDC91C4AFE@accre.vanderbilt.edu> Message-ID: On Thu, Oct 20, 2011 at 4:19 PM, Charles Johnson wrote: > Are there circumstances in which a pbs_mom would be running as a user rather than root? yes. pbs_mom occasionally forks processes that take on the uid/gid of a user. From knielson at adaptivecomputing.com Thu Oct 20 20:04:48 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Thu, 20 Oct 2011 20:04:48 -0600 (MDT) Subject: [torqueusers] Running torque with iptables In-Reply-To: <20111021013904.GH12776@lbl.gov> Message-ID: <1210c46c-1073-4ae6-91aa-1d19a0423510@mail> ---- Original Message ----- > From: "Michael Jennings" > To: torqueusers at supercluster.org > Sent: Thursday, October 20, 2011 7:39:04 PM > Subject: Re: [torqueusers] Running torque with iptables > > On Wednesday, 19 October 2011, at 16:09:48 (-0500), > Ti Leggett wrote: > > > We're rolling out locking down machines much more tightly using > > iptables after a security incident. I've read the documentation and > > I have tcp/udp 15001 and tcp 15004 open on the PBS server, I have > > tcp 15002, tcp/udp 15003 and udp 0-1023 opened on the PBS MOMs and > > I > > have udp 0-1023 on the submit hosts. However it seems the MOM > > superior is trying to talk back to the submit host on tcp ephemeral > > ports 1024. Is there any way to restrict the range of those ports > > it's trying to use so that I can open those up appropriately, or am > > I going to have to take the (undesired) route of opening everything > > up between the MOMs and submit hosts? > > In src/cmds/qsub.c, function interactive_port(), the following code > determines that the port number will be arbitrary for the interactive > job listener: > > myaddr.sin_port = 0; > > Two possible solutions here: If you know only 1 qsub -I will ever be > running on a particular node at any one time, you can hardcode the > port here by changing 0 to htons(12345) (or whatever port number you > choose). > > The better solution is going to wrap the bind() in a for loop to try > a > range of port numbers consecutively until the bind() succeeds (or you > run out of ports). > > for (port = LOW_PORT; port <= HIGH_PORT; port++) { > myaddr.sin_port = htons(port); > if (bind(*sock, (struct sockaddr *)&myaddr, namelen) >= 0) { > break; > } > } > if (port > HIGH_PORT) { > perror("qsub: unable to bind to socket"); > exit(1); > } > > Something like that. > > If that works, you (or someone) might be inclined to add a > configuration option to specify the port range. :-) I vote it goes in the torque.cfg. Ken From glen.beane at gmail.com Thu Oct 20 20:10:08 2011 From: glen.beane at gmail.com (Glen Beane) Date: Thu, 20 Oct 2011 22:10:08 -0400 Subject: [torqueusers] mom ownership In-Reply-To: <67cf830e-f4a0-402a-b044-f984fa0922f0@mail> References: <4E1BEAAD-1706-4F18-BAC3-2E4CA635E4F9@accre.vanderbilt.edu> <67cf830e-f4a0-402a-b044-f984fa0922f0@mail> Message-ID: On Thu, Oct 20, 2011 at 4:36 PM, David Beer wrote: > > ----- Original Message ----- >> On Oct 20, 2011, at 3:25 PM, Ken Nielson wrote: >> >> > The mom daemon should be running as root. When jobs are started by >> > the mom the jobs run as the user. >> >> As I thought. We found nodes with pbs_mom's running as a user rather >> than root. Something for us to track down. >> >> Thanks, again. >> >> ~Charles~ >> -- >> Charles Johnson, Vanderbilt University >> Advanced Computing Center for Research & Education >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> > > Charles, > > What version of TORQUE are you running? Some have reported problems with TORQUE changing its id to a user and not changing back correctly. I believe that this problem was found in 2.5.6. I'm not the one who was working on this problem, so I'm not sure what the status of it is currently, but it could be you're experiencing a symptom of this bug. I'm not sure if anyone is working on that right now -- after talking to Ken and looking at the code some more, I am not sure my problem is what I originally though... but back to the original post, it is not necessarily indicative of a problem if you notice pbs_mom processes running as a user other than root (these should be children processes of the root pbs_mom process). This is sometimes done do copy or delete certain files. If the original poster has problems with the main pbs_mom thread changing its UID, and ends up with lots of "permission denied" type errors in syslog, then maybe there is something to the bug I thought I saw... From leggett at mcs.anl.gov Fri Oct 21 08:53:15 2011 From: leggett at mcs.anl.gov (Ti Leggett) Date: Fri, 21 Oct 2011 09:53:15 -0500 Subject: [torqueusers] Running torque with iptables In-Reply-To: References: Message-ID: <221C17DB-25B6-4649-A62B-74E1BEB439E5@mcs.anl.gov> What's the proper way to make a feature request to have a runtime configuration option for the port ranges used? On Oct 20, 2011, at 2:55 PM, Ken Nielson wrote: > ----- Original Message ----- >> From: "Ti Leggett" >> To: "Torque Users Mailing List" >> Sent: Wednesday, October 19, 2011 3:09:48 PM >> Subject: [torqueusers] Running torque with iptables >> >> We're rolling out locking down machines much more tightly using >> iptables after a security incident. I've read the documentation and >> I have tcp/udp 15001 and tcp 15004 open on the PBS server, I have >> tcp 15002, tcp/udp 15003 and udp 0-1023 opened on the PBS MOMs and I >> have udp 0-1023 on the submit hosts. However it seems the MOM >> superior is trying to talk back to the submit host on tcp ephemeral >> ports >1024. Is there any way to restrict the range of those ports >> it's trying to use so that I can open those up appropriately, or am >> I going to have to take the (undesired) route of opening everything >> up between the MOMs and submit hosts? > > The MOMs should only communicate with pbs_server and the other MOMs. I do not believe they communicate with the submit hosts. Could you tell us more about your setup? > > Regards > > Ken > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 163 bytes Desc: Message signed with OpenPGP using GPGMail Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20111021/c4528df2/attachment.bin From knielson at adaptivecomputing.com Fri Oct 21 11:30:40 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Fri, 21 Oct 2011 11:30:40 -0600 (MDT) Subject: [torqueusers] Running torque with iptables In-Reply-To: <221C17DB-25B6-4649-A62B-74E1BEB439E5@mcs.anl.gov> Message-ID: <97e17e02-8bf7-44c0-9902-c4ecd2299961@mail> ----- Original Message ----- > From: "Ti Leggett" > To: "Torque Users Mailing List" > Sent: Friday, October 21, 2011 8:53:15 AM > Subject: Re: [torqueusers] Running torque with iptables > > What's the proper way to make a feature request to have a runtime > configuration option for the port ranges used? Create a bugzilla entry for it. www.clusterresources.com/bugzilla Ken From halmabrazi at idtdna.com Fri Oct 21 13:57:43 2011 From: halmabrazi at idtdna.com (Hakeem Almabrazi) Date: Fri, 21 Oct 2011 19:57:43 +0000 Subject: [torqueusers] job exceeds queue ... message! Message-ID: Hi, I keep getting this message when I submit a job " qsub: job exceeds queue resource limits MSG=cannot locate feasible nodes " The submitted job #PBS -N name #PBS -l nodes=2:ppn=1 #PBS -q bio .... Here is the contents of my nodes file Node1 np=1 Node2 np=1 Node3 np=1 Node4 np=1 Any idea what I could be missing here? Regards, Hak -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111021/a29b404a/attachment-0001.html From leggett at mcs.anl.gov Fri Oct 21 14:02:00 2011 From: leggett at mcs.anl.gov (Ti Leggett) Date: Fri, 21 Oct 2011 15:02:00 -0500 Subject: [torqueusers] Running torque with iptables In-Reply-To: <97e17e02-8bf7-44c0-9902-c4ecd2299961@mail> References: <97e17e02-8bf7-44c0-9902-c4ecd2299961@mail> Message-ID: Done. Bug 162. On Oct 21, 2011, at 12:30 PM, Ken Nielson wrote: > ----- Original Message ----- >> From: "Ti Leggett" >> To: "Torque Users Mailing List" >> Sent: Friday, October 21, 2011 8:53:15 AM >> Subject: Re: [torqueusers] Running torque with iptables >> >> What's the proper way to make a feature request to have a runtime >> configuration option for the port ranges used? > > Create a bugzilla entry for it. www.clusterresources.com/bugzilla > > Ken > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 163 bytes Desc: Message signed with OpenPGP using GPGMail Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20111021/dcfe776a/attachment.bin From j.kasiak at gmail.com Sat Oct 22 17:10:00 2011 From: j.kasiak at gmail.com (Jan Kasiak) Date: Sat, 22 Oct 2011 19:10:00 -0400 Subject: [torqueusers] Queue Node Type and ppn Message-ID: Hello Everyone, I'm using torque-3.0.0 and maui-3.3_pbs I have searched far and wide for a solution to this problem..and I can't find out how to set this up. I have 3 node types: (13) * p5300, (16) * p5400 and (39) * towel (set in my nodes file): towel01 np=12 towel ... p530001 np=8 p5300 ... p540001 np=8 p5400 ... I want to prevent users from mixing node types for their jobs. I want to set up 3 queues, one for each node type, in such a way that if you submit to queue p5300, it will error for the following: qsub -I -q p5300 -lnodes=14:ppn=8 (more than 13 nodes) qsub -I -q p5300 -lnodes=13:ppn=9 (ppn greater than available for p5300) qsub -I -q p5300 -lnodes=2:ppn=9 (ppn greater than available for p5300) Is this possible to set up? Thanks, -Jan Kasiak From lance at quantumbioinc.com Fri Oct 21 13:37:37 2011 From: lance at quantumbioinc.com (Lance Westerhoff) Date: Fri, 21 Oct 2011 15:37:37 -0400 Subject: [torqueusers] torque/maui disregarding pmem with procs Message-ID: <8B92FEF2-6148-40BE-B233-2A11D6D2AF08@quantumbioinc.com> Hello all- We are trying to use procs and pmem on an 18 node cluster with nodes of various memory size. pbsnodes shows the correct memory complement for each node, so apparently PBS is getting the right specs (see the output of pbsnodes below for more information). If we use the following settings in the PBS script, invariably PBS will try to fill up the all 8 of the 8 cores of each node. That is even though there is no where near enough memory on any of these nodes for 8*3700mb=29600mb. Considering the physical memory limit goes from 8GB to 24GB depending upon the node, this is just taking down nodes left and right. Below I have provided an example along with the associated output. I also provided the output for pbsnodes in case there is something I am missing here. Thanks for your help! torque version: tried 2.5.4, 2.5.8, and 3.0.2 maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete fail in terms of the procs option and it only asks for a single CPU) $ cat tmp.pbs #!/bin/bash #PBS -S /bin/bash #PBS -l procs=24 #PBS -l pmem=3700mb #PBS -l walltime=6:00:00 #PBS -j oe cat $PBS_NODEFILE $ qsub tmp.pbs 337003.XXXX $ wc -l tmp.pbs.o337003 24 tmp.pbs.o337003 $ cat tmp.pbs.o337003 compute-0-14 compute-0-14 compute-0-14 compute-0-14 compute-0-14 compute-0-14 compute-0-14 compute-0-14 compute-0-15 compute-0-15 compute-0-15 compute-0-15 compute-0-15 compute-0-15 compute-0-15 compute-0-15 compute-0-16 compute-0-16 compute-0-16 compute-0-16 compute-0-16 compute-0-16 compute-0-16 compute-0-16 $ pbsnodes -a compute-0-16 state = free np = 8 ntype = cluster status = rectime=1319219085,varattr=,jobs=,state=free,netload=1834011936,gres=,loadave=0.00,ncpus=8,physmem=8177300kb,availmem=10095652kb,totmem=10225576kb,idletime=5582,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux compute-0-16.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT 2011 x86_64,opsys=linux gpus = 0 compute-0-15 state = free np = 8 ntype = cluster status = rectime=1319219090,varattr=,jobs=,state=free,netload=700017694,gres=,loadave=0.00,ncpus=8,physmem=8177300kb,availmem=10150996kb,totmem=10225576kb,idletime=5606,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux compute-0-15.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT 2011 x86_64,opsys=linux gpus = 0 compute-0-14 state = free np = 8 ntype = cluster status = rectime=1319219090,varattr=,jobs=,state=free,netload=1003164957,gres=,loadave=0.00,ncpus=8,physmem=8177300kb,availmem=10131180kb,totmem=10225576kb,idletime=5615,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux compute-0-14.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT 2011 x86_64,opsys=linux gpus = 0 compute-0-13 state = free np = 8 ntype = cluster status = rectime=1319219090,varattr=,jobs=,state=free,netload=1173266470,gres=,loadave=0.00,ncpus=8,physmem=8177300kb,availmem=10132104kb,totmem=10225576kb,idletime=5637,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux compute-0-13.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT 2011 x86_64,opsys=linux gpus = 0 compute-0-12 state = free np = 8 ntype = cluster status = rectime=1319219090,varattr=,jobs=,state=free,netload=3991477,gres=,loadave=0.00,ncpus=8,physmem=12301956kb,availmem=14276448kb,totmem=14350232kb,idletime=5604,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux compute-0-12.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT 2011 x86_64,opsys=linux gpus = 0 compute-0-11 state = free np = 8 ntype = cluster status = rectime=1319219090,varattr=,jobs=,state=free,netload=2947879,gres=,loadave=0.00,ncpus=8,physmem=12301956kb,availmem=14274604kb,totmem=14350232kb,idletime=5588,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux compute-0-11.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT 2011 x86_64,opsys=linux gpus = 0 compute-0-9 state = free np = 8 ntype = cluster status = rectime=1319219090,varattr=,jobs=,state=free,netload=3721396,gres=,loadave=0.05,ncpus=8,physmem=12301956kb,availmem=14253816kb,totmem=14350232kb,idletime=5660,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux compute-0-9.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT 2011 x86_64,opsys=linux gpus = 0 compute-0-8 state = free np = 8 ntype = cluster status = rectime=1319219090,varattr=,jobs=,state=free,netload=2934478,gres=,loadave=0.00,ncpus=8,physmem=12301956kb,availmem=14254796kb,totmem=14350232kb,idletime=5675,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux compute-0-8.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT 2011 x86_64,opsys=linux gpus = 0 compute-0-7 state = free np = 8 ntype = cluster status = rectime=1319219090,varattr=,jobs=,state=free,netload=2909406,gres=,loadave=0.00,ncpus=8,physmem=12301956kb,availmem=14254812kb,totmem=14350232kb,idletime=5489,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux compute-0-7.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT 2011 x86_64,opsys=linux gpus = 0 compute-0-6 state = free np = 8 ntype = cluster status = rectime=1319219090,varattr=,jobs=,state=free,netload=2936791,gres=,loadave=0.00,ncpus=8,physmem=12301956kb,availmem=14275644kb,totmem=14350232kb,idletime=5748,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux compute-0-6.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT 2011 x86_64,opsys=linux gpus = 0 compute-0-5 state = free np = 8 ntype = cluster status = rectime=1319219090,varattr=,jobs=,state=free,netload=2966183,gres=,loadave=0.00,ncpus=8,physmem=12301956kb,availmem=14276260kb,totmem=14350232kb,idletime=5695,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux compute-0-5.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT 2011 x86_64,opsys=linux gpus = 0 compute-0-4 state = free np = 8 ntype = cluster status = rectime=1319219090,varattr=,jobs=,state=free,netload=2886627,gres=,loadave=0.00,ncpus=8,physmem=16438900kb,availmem=18412332kb,totmem=18487176kb,idletime=5634,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux compute-0-4.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT 2011 x86_64,opsys=linux gpus = 0 compute-0-3 state = free np = 8 properties = lustre ntype = cluster status = rectime=1319219108,varattr=,jobs=,state=free,netload=436527254,gres=,loadave=0.00,ncpus=8,physmem=24688212kb,availmem=26636656kb,totmem=26736488kb,idletime=2224,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux compute-0-3.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT 2011 x86_64,opsys=linux gpus = 0 compute-0-2 state = free np = 8 properties = lustre ntype = cluster status = rectime=1319219106,varattr=,jobs=,state=free,netload=1184385,gres=,loadave=0.00,ncpus=8,physmem=24688212kb,availmem=26659668kb,totmem=26736488kb,idletime=2223,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux compute-0-2.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT 2011 x86_64,opsys=linux gpus = 0 compute-0-1 state = free np = 8 properties = lustre ntype = cluster status = rectime=1319219102,varattr=,jobs=,state=free,netload=1258074,gres=,loadave=0.00,ncpus=8,physmem=24688212kb,availmem=26657304kb,totmem=26736488kb,idletime=2228,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux compute-0-1.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT 2011 x86_64,opsys=linux gpus = 0 compute-0-0 state = free np = 8 ntype = cluster status = rectime=1319219090,varattr=,jobs=,state=free,netload=3416356,gres=,loadave=0.00,ncpus=8,physmem=24688212kb,availmem=26635624kb,totmem=26736488kb,idletime=5603,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux compute-0-0.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT 2011 x86_64,opsys=linux gpus = 0 compute-0-10 state = free np = 2 ntype = cluster status = rectime=1319219090,varattr=,jobs=,state=free,netload=283846193,gres=,loadave=0.23,ncpus=8,physmem=12301956kb,availmem=13762696kb,totmem=14350232kb,idletime=5622,nusers=1,nsessions=1,sessions=3410,uname=Linux compute-0-10.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT 2011 x86_64,opsys=linux gpus = 0 compute-0-17 state = free np = 8 properties = testbox ntype = cluster status = rectime=1319219090,varattr=,jobs=,state=free,netload=2948331,gres=,loadave=0.00,ncpus=8,physmem=8177300kb,availmem=10144432kb,totmem=10225576kb,idletime=5558,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux compute-0-17.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT 2011 x86_64,opsys=linux gpus = 0 From lance at quantumbioinc.com Sun Oct 23 14:50:01 2011 From: lance at quantumbioinc.com (Lance Westerhoff) Date: Sun, 23 Oct 2011 16:50:01 -0400 Subject: [torqueusers] torque/maui disregarding pmem with procs Message-ID: Hello all- We are trying to use procs and pmem on an 18 node cluster with nodes of various memory size. pbsnodes shows the correct memory complement for each node, so apparently PBS is getting the right specs (see the output of pbsnodes below for more information). If we use the following settings in the PBS script, invariably PBS will try to fill up the all 8 of the 8 cores of each node. That is even though there is nowhere near enough memory on any of these nodes for 8*3700mb=29600mb. Considering the physical memory limit goes from 8GB to 24GB depending upon the node, this is just taking down nodes left and right. Below I have provided an example along with the associated output. I also provided the output for pbsnodes in case there is something I am missing here. Thanks for your help! -Lance torque version: tried 2.5.4, 2.5.8, and 3.0.2 maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete fail in terms of the procs option and it only asks for a single CPU) $ cat tmp.pbs #!/bin/bash #PBS -S /bin/bash #PBS -l procs=24 #PBS -l pmem=3700mb #PBS -l walltime=6:00:00 #PBS -j oe cat $PBS_NODEFILE $ qsub tmp.pbs 337003.XXXX $ wc -l tmp.pbs.o337003 24 tmp.pbs.o337003 $ cat tmp.pbs.o337003 compute-0-14 compute-0-14 compute-0-14 compute-0-14 compute-0-14 compute-0-14 compute-0-14 compute-0-14 compute-0-15 compute-0-15 compute-0-15 compute-0-15 compute-0-15 compute-0-15 compute-0-15 compute-0-15 compute-0-16 compute-0-16 compute-0-16 compute-0-16 compute-0-16 compute-0-16 compute-0-16 compute-0-16 $ pbsnodes -a compute-0-16 state = free np = 8 ntype = cluster status = rectime=1319219085,varattr=,jobs=,state=free,netload=1834011936,gres=,loadave=0.00,ncpus=8,physmem=8177300kb,availmem=10095652kb,totmem=10225576kb,idletime=5582,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux compute-0-16.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT 2011 x86_64,opsys=linux gpus = 0 compute-0-15 state = free np = 8 ntype = cluster status = rectime=1319219090,varattr=,jobs=,state=free,netload=700017694,gres=,loadave=0.00,ncpus=8,physmem=8177300kb,availmem=10150996kb,totmem=10225576kb,idletime=5606,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux compute-0-15.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT 2011 x86_64,opsys=linux gpus = 0 compute-0-14 state = free np = 8 ntype = cluster status = rectime=1319219090,varattr=,jobs=,state=free,netload=1003164957,gres=,loadave=0.00,ncpus=8,physmem=8177300kb,availmem=10131180kb,totmem=10225576kb,idletime=5615,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux compute-0-14.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT 2011 x86_64,opsys=linux gpus = 0 compute-0-13 state = free np = 8 ntype = cluster status = rectime=1319219090,varattr=,jobs=,state=free,netload=1173266470,gres=,loadave=0.00,ncpus=8,physmem=8177300kb,availmem=10132104kb,totmem=10225576kb,idletime=5637,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux compute-0-13.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT 2011 x86_64,opsys=linux gpus = 0 compute-0-12 state = free np = 8 ntype = cluster status = rectime=1319219090,varattr=,jobs=,state=free,netload=3991477,gres=,loadave=0.00,ncpus=8,physmem=12301956kb,availmem=14276448kb,totmem=14350232kb,idletime=5604,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux compute-0-12.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT 2011 x86_64,opsys=linux gpus = 0 compute-0-11 state = free np = 8 ntype = cluster status = rectime=1319219090,varattr=,jobs=,state=free,netload=2947879,gres=,loadave=0.00,ncpus=8,physmem=12301956kb,availmem=14274604kb,totmem=14350232kb,idletime=5588,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux compute-0-11.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT 2011 x86_64,opsys=linux gpus = 0 compute-0-9 state = free np = 8 ntype = cluster status = rectime=1319219090,varattr=,jobs=,state=free,netload=3721396,gres=,loadave=0.05,ncpus=8,physmem=12301956kb,availmem=14253816kb,totmem=14350232kb,idletime=5660,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux compute-0-9.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT 2011 x86_64,opsys=linux gpus = 0 compute-0-8 state = free np = 8 ntype = cluster status = rectime=1319219090,varattr=,jobs=,state=free,netload=2934478,gres=,loadave=0.00,ncpus=8,physmem=12301956kb,availmem=14254796kb,totmem=14350232kb,idletime=5675,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux compute-0-8.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT 2011 x86_64,opsys=linux gpus = 0 compute-0-7 state = free np = 8 ntype = cluster status = rectime=1319219090,varattr=,jobs=,state=free,netload=2909406,gres=,loadave=0.00,ncpus=8,physmem=12301956kb,availmem=14254812kb,totmem=14350232kb,idletime=5489,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux compute-0-7.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT 2011 x86_64,opsys=linux gpus = 0 compute-0-6 state = free np = 8 ntype = cluster status = rectime=1319219090,varattr=,jobs=,state=free,netload=2936791,gres=,loadave=0.00,ncpus=8,physmem=12301956kb,availmem=14275644kb,totmem=14350232kb,idletime=5748,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux compute-0-6.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT 2011 x86_64,opsys=linux gpus = 0 compute-0-5 state = free np = 8 ntype = cluster status = rectime=1319219090,varattr=,jobs=,state=free,netload=2966183,gres=,loadave=0.00,ncpus=8,physmem=12301956kb,availmem=14276260kb,totmem=14350232kb,idletime=5695,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux compute-0-5.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT 2011 x86_64,opsys=linux gpus = 0 compute-0-4 state = free np = 8 ntype = cluster status = rectime=1319219090,varattr=,jobs=,state=free,netload=2886627,gres=,loadave=0.00,ncpus=8,physmem=16438900kb,availmem=18412332kb,totmem=18487176kb,idletime=5634,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux compute-0-4.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT 2011 x86_64,opsys=linux gpus = 0 compute-0-3 state = free np = 8 properties = lustre ntype = cluster status = rectime=1319219108,varattr=,jobs=,state=free,netload=436527254,gres=,loadave=0.00,ncpus=8,physmem=24688212kb,availmem=26636656kb,totmem=26736488kb,idletime=2224,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux compute-0-3.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT 2011 x86_64,opsys=linux gpus = 0 compute-0-2 state = free np = 8 properties = lustre ntype = cluster status = rectime=1319219106,varattr=,jobs=,state=free,netload=1184385,gres=,loadave=0.00,ncpus=8,physmem=24688212kb,availmem=26659668kb,totmem=26736488kb,idletime=2223,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux compute-0-2.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT 2011 x86_64,opsys=linux gpus = 0 compute-0-1 state = free np = 8 properties = lustre ntype = cluster status = rectime=1319219102,varattr=,jobs=,state=free,netload=1258074,gres=,loadave=0.00,ncpus=8,physmem=24688212kb,availmem=26657304kb,totmem=26736488kb,idletime=2228,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux compute-0-1.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT 2011 x86_64,opsys=linux gpus = 0 compute-0-0 state = free np = 8 ntype = cluster status = rectime=1319219090,varattr=,jobs=,state=free,netload=3416356,gres=,loadave=0.00,ncpus=8,physmem=24688212kb,availmem=26635624kb,totmem=26736488kb,idletime=5603,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux compute-0-0.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT 2011 x86_64,opsys=linux gpus = 0 compute-0-10 state = free np = 2 ntype = cluster status = rectime=1319219090,varattr=,jobs=,state=free,netload=283846193,gres=,loadave=0.23,ncpus=8,physmem=12301956kb,availmem=13762696kb,totmem=14350232kb,idletime=5622,nusers=1,nsessions=1,sessions=3410,uname=Linux compute-0-10.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT 2011 x86_64,opsys=linux gpus = 0 compute-0-17 state = free np = 8 properties = testbox ntype = cluster status = rectime=1319219090,varattr=,jobs=,state=free,netload=2948331,gres=,loadave=0.00,ncpus=8,physmem=8177300kb,availmem=10144432kb,totmem=10225576kb,idletime=5558,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux compute-0-17.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT 2011 x86_64,opsys=linux gpus = 0 (please note: sorry if you received this email twice, but for some reason the first time I sent it the email did not appear in the archive). From knielson at adaptivecomputing.com Mon Oct 24 09:29:31 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Mon, 24 Oct 2011 09:29:31 -0600 (MDT) Subject: [torqueusers] 2.5.9 snapshot available In-Reply-To: <05054870-a809-4f73-abe4-2c65dbe7bda9@mail> Message-ID: <02103367-0447-4c9b-9bc9-27558a7fab36@mail> Hi all, There is a TORQUE snapshot available to try. This snapshot fixes a problem with PAM modules where getpwnam_ext could not be resolved. It also fixes a problem with job array dependencies where afteranyarray and afterokarry would not allow jobs to run when the dependency was satisfied on the array. The file can be downloaded from http://www.clusterresources.com/downloads/torque/snapshots/torque-2.5.9-snap.201110240923.tar.gz Please download this and try it. Let us know if it does not fix the problems described above or if it gives you any other problems. Regards Ken Nielson Adaptive Computing From lloyd_brown at byu.edu Mon Oct 24 10:53:14 2011 From: lloyd_brown at byu.edu (Lloyd Brown) Date: Mon, 24 Oct 2011 10:53:14 -0600 Subject: [torqueusers] 2.5.9 snapshot available In-Reply-To: <02103367-0447-4c9b-9bc9-27558a7fab36@mail> References: <02103367-0447-4c9b-9bc9-27558a7fab36@mail> Message-ID: <4EA597FA.8090609@byu.edu> >From what I can see, the issue with the PAM module, and getpwnam_ext, seems to be fixed in this snapshot. Thank you. Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu On 10/24/2011 09:29 AM, Ken Nielson wrote: > Hi all, > > There is a TORQUE snapshot available to try. This snapshot fixes a problem with PAM modules where getpwnam_ext could not be resolved. > > It also fixes a problem with job array dependencies where afteranyarray and afterokarry would not allow jobs to run when the dependency was satisfied on the array. > > The file can be downloaded from http://www.clusterresources.com/downloads/torque/snapshots/torque-2.5.9-snap.201110240923.tar.gz > > Please download this and try it. Let us know if it does not fix the problems described above or if it gives you any other problems. > > Regards > > Ken Nielson > Adaptive Computing > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From Gareth.Williams at csiro.au Mon Oct 24 14:32:01 2011 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Tue, 25 Oct 2011 07:32:01 +1100 Subject: [torqueusers] Queue Node Type and ppn In-Reply-To: References: Message-ID: <007DECE986B47F4EABF823C1FBB19C620102BBE547C5@exvic-mbx04.nexus.csiro.au> > -----Original Message----- > From: Jan Kasiak [mailto:j.kasiak at gmail.com] > Sent: Sunday, 23 October 2011 10:10 AM > To: torqueusers at supercluster.org > Subject: [torqueusers] Queue Node Type and ppn > > Hello Everyone, > > I'm using torque-3.0.0 and maui-3.3_pbs > I have searched far and wide for a solution to this problem..and I > can't find out how to set this up. > I have 3 node types: (13) * p5300, (16) * p5400 and (39) * towel (set > in my nodes file): > > towel01 np=12 towel > ... > p530001 np=8 p5300 > ... > p540001 np=8 p5400 > ... > > I want to prevent users from mixing node types for their jobs. > > I want to set up 3 queues, one for each node type, in such a way that > if you submit to queue p5300, it will error for the following: > qsub -I -q p5300 -lnodes=14:ppn=8 (more than 13 nodes) > qsub -I -q p5300 -lnodes=13:ppn=9 (ppn greater than available for > p5300) > qsub -I -q p5300 -lnodes=2:ppn=9 (ppn greater than available for p5300) > > Is this possible to set up? > > Thanks, > -Jan Kasiak Hi Jan, I'll stick my neck out and state that the only way to reject jobs based on the nodes and ppn as you would like is to use a submit filter. The rest can be done with reservations for each type of nodes and queues/classes that can access those reservations, or maybe just with types and defaults on queues. Gareth From j.kasiak at gmail.com Mon Oct 24 15:23:19 2011 From: j.kasiak at gmail.com (Jan Kasiak) Date: Mon, 24 Oct 2011 17:23:19 -0400 Subject: [torqueusers] Queue Node Type and ppn In-Reply-To: <007DECE986B47F4EABF823C1FBB19C620102BBE547C5@exvic-mbx04.nexus.csiro.au> References: <007DECE986B47F4EABF823C1FBB19C620102BBE547C5@exvic-mbx04.nexus.csiro.au> Message-ID: Gareth, I had started to write my own wrapper to the qsub binary. I didnt know that torque had submit filter support. I will proceed with this option and see how it goes. Many Thanks, -Jan On Oct 24, 2011 4:32 PM, wrote: > > -----Original Message----- > > From: Jan Kasiak [mailto:j.kasiak at gmail.com] > > Sent: Sunday, 23 October 2011 10:10 AM > > To: torqueusers at supercluster.org > > Subject: [torqueusers] Queue Node Type and ppn > > > > Hello Everyone, > > > > I'm using torque-3.0.0 and maui-3.3_pbs > > I have searched far and wide for a solution to this problem..and I > > can't find out how to set this up. > > I have 3 node types: (13) * p5300, (16) * p5400 and (39) * towel (set > > in my nodes file): > > > > towel01 np=12 towel > > ... > > p530001 np=8 p5300 > > ... > > p540001 np=8 p5400 > > ... > > > > I want to prevent users from mixing node types for their jobs. > > > > I want to set up 3 queues, one for each node type, in such a way that > > if you submit to queue p5300, it will error for the following: > > qsub -I -q p5300 -lnodes=14:ppn=8 (more than 13 nodes) > > qsub -I -q p5300 -lnodes=13:ppn=9 (ppn greater than available for > > p5300) > > qsub -I -q p5300 -lnodes=2:ppn=9 (ppn greater than available for p5300) > > > > Is this possible to set up? > > > > Thanks, > > -Jan Kasiak > > Hi Jan, > > I'll stick my neck out and state that the only way to reject jobs based on > the nodes and ppn as you would like is to use a submit filter. > > The rest can be done with reservations for each type of nodes and > queues/classes that can access those reservations, or maybe just with types > and defaults on queues. > > Gareth > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111024/85cdafea/attachment.html From abraham.zamudio at gmail.com Mon Oct 24 13:59:54 2011 From: abraham.zamudio at gmail.com (Abraham Zamudio) Date: Mon, 24 Oct 2011 14:59:54 -0500 Subject: [torqueusers] Various Serials job In-Reply-To: References: Message-ID: Hi everybody , I try send a varios jobs (serial program with diferent inputs and different outputs) , I am looking for which is the best way ? . Thx . -- Abraham Zamudio Ch. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111024/a7de1979/attachment.html From abraham.zamudio at gmail.com Mon Oct 24 15:13:54 2011 From: abraham.zamudio at gmail.com (Abraham Zamudio) Date: Mon, 24 Oct 2011 16:13:54 -0500 Subject: [torqueusers] Problem with pbsdsh Message-ID: Hi people , I have a following problem . I am trying run various copies of the following code : #include #include #include #include #include int main (int argc, char **argv) { int i,j; int n,m; n=10000; m=10000; gsl_matrix * A = gsl_matrix_alloc(n,m); struct timeval tval; gettimeofday(&tval, 0); long int NN = (tval.tv_sec ^ tval.tv_usec) ^ getpid() ; srand(NN); for (i = 0; i < n; i++) for (j = 0; j < m; j++) gsl_matrix_set (A, i, j, rand()); FILE * f = fopen(argv[1],"wb"); gsl_matrix_fwrite (f, A); fclose (f); gsl_matrix_free (A); return 0; } The compilation is with : gcc -I/usr/include/gsl -Wall -pedantic -ggdb -std=c99 -lgsl -lgslcblas -o matrix matrix.c Basically what i want is to generate an output file (matrix_$PBS_VNODENUM.dat) for each processor . My qsub file : #PBS -S /bin/bash #PBS -V #PBS -N matrix2 #PBS -q batch #PBS -l nodes=quad4:ppn=4 ##### #PBS -j oe #PBS -e matrix_$PBS_JOBID.err #PBS -o matrix_$PBS_JOBID.out pbsdsh -v $PBS_O_WORKDIR/matrix2 $PBS_O_WORKDIR/matrix_$PBS_VNODENUM.dat The problem is that only stores an output file (matrix_0.dat) . I'm looking for some advice . -- Abraham Zamudio Ch. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111024/cfd12275/attachment.html From jjc at iastate.edu Tue Oct 25 09:16:47 2011 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Tue, 25 Oct 2011 10:16:47 -0500 Subject: [torqueusers] Problem with pbsdsh In-Reply-To: References: Message-ID: Abraham, I think $PBS_VNODENUM only gets a number inside a copy of a script launched by pbsdsh. This is similar to the value of n outside of the loop for(n=0;n<4;n++){ } Try launching a script which uses matrix_$PBS_VNODENUM.dat internally. E.g. script1 is executable in $PBS_O_WORKDIR and contains: #!/bin/bash ./ matrix2 matrix_$PBS_VNODENUM.dat and change the pbsdsh command in your script to the two lines: cd $PBS_O_WORKDIR pbsdsh script1 James Coyle, PhD High Performance Computing Group Iowa State Univ. web: http://jjc.public.iastate.edu/ From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Abraham Zamudio Sent: Monday, October 24, 2011 4:14 PM To: Torque Users Mailing List; David Beer; Ken Nielson; tbaer at utk.edu Subject: [torqueusers] Problem with pbsdsh Hi people , I have a following problem . I am trying run various copies of the following code : #include #include #include #include #include int main (int argc, char **argv) { int i,j; int n,m; n=10000; m=10000; gsl_matrix * A = gsl_matrix_alloc(n,m); struct timeval tval; gettimeofday(&tval, 0); long int NN = (tval.tv_sec ^ tval.tv_usec) ^ getpid() ; srand(NN); for (i = 0; i < n; i++) for (j = 0; j < m; j++) gsl_matrix_set (A, i, j, rand()); FILE * f = fopen(argv[1],"wb"); gsl_matrix_fwrite (f, A); fclose (f); gsl_matrix_free (A); return 0; } The compilation is with : gcc -I/usr/include/gsl -Wall -pedantic -ggdb -std=c99 -lgsl -lgslcblas -o matrix matrix.c Basically what i want is to generate an output file (matrix_$PBS_VNODENUM.dat) for each processor . My qsub file : #PBS -S /bin/bash #PBS -V #PBS -N matrix2 #PBS -q batch #PBS -l nodes=quad4:ppn=4 ##### #PBS -j oe #PBS -e matrix_$PBS_JOBID.err #PBS -o matrix_$PBS_JOBID.out pbsdsh -v $PBS_O_WORKDIR/matrix2 $PBS_O_WORKDIR/matrix_$PBS_VNODENUM.dat The problem is that only stores an output file (matrix_0.dat) . I'm looking for some advice . -- Abraham Zamudio Ch. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111025/cebe0f5a/attachment-0001.html From abraham.zamudio at gmail.com Tue Oct 25 09:51:45 2011 From: abraham.zamudio at gmail.com (Abraham Zamudio) Date: Tue, 25 Oct 2011 10:51:45 -0500 Subject: [torqueusers] Problem with pbsdsh In-Reply-To: References: Message-ID: Dr. Cole , Now , i have the following problem : PBS: /jro_cluster/mpiX/Matrix/script1: Permission denied pbsdsh: error 17000 on spawn PBS: /jro_cluster/mpiX/Matrix/script1: Permission denied pbsdsh: error 17000 on spawn PBS: /jro_cluster/mpiX/Matrix/script1: Permission denied pbsdsh: error 17000 on spawn PBS: /jro_cluster/mpiX/Matrix/script1: Permission denied pbsdsh: error 17000 on spawn I do not undestand because i have no permissions . My qsub file : #PBS -S /bin/bash #PBS -V #PBS -N matrix2 #PBS -q batch #PBS -l nodes=quad4:ppn=4 ##### #PBS -j oe #PBS -e matrix_$PBS_JOBID.err #PBS -o matrix_$PBS_JOBID.out pbsdsh $PBS_O_WORKDIR/script1 And the script1 file : #!/bin/bash $PBS_O_WORKDIR/matrix2 $PBS_O_WORKDIR/matrix_$PBS_VNODENUM.dat On Tue, Oct 25, 2011 at 10:16 AM, Coyle, James J [ITACD] wrote: > Abraham,**** > > ** ** > > I think $PBS_VNODENUM only gets a number inside a copy of**** > > a script launched by pbsdsh.**** > > ** ** > > This is similar to the value of **** > > n **** > > outside of the loop **** > > for(n=0;n<4;n++){**** > > }**** > > ** ** > > ** ** > > ** ** > > Try launching a script which uses **** > > matrix_$PBS_VNODENUM.dat**** > > ** ** > > internally.**** > > ** ** > > ** ** > > E.g.**** > > ** ** > > script1 is executable in $PBS_O_WORKDIR and contains:**** > > ** ** > > #!/bin/bash**** > > ** ** > > ./ matrix2 matrix_$PBS_VNODENUM.dat**** > > ** ** > > and change the pbsdsh command in your script to the two lines:**** > > ** ** > > cd $PBS_O_WORKDIR**** > > pbsdsh script1**** > > ** ** > > ** ** > > James Coyle, PhD**** > > High Performance Computing Group **** > > Iowa State Univ. **** > > web: http://jjc.public.iastate.edu/ > **** > > ** ** > > ** ** > > ** ** > > ** ** > > ** ** > > *From:* torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] *On Behalf Of *Abraham Zamudio > *Sent:* Monday, October 24, 2011 4:14 PM > *To:* Torque Users Mailing List; David Beer; Ken Nielson; tbaer at utk.edu > *Subject:* [torqueusers] Problem with pbsdsh**** > > ** ** > > Hi people , > > I have a following problem . I am trying run various copies of the > following code : > > #include > #include > #include > #include > #include > > int main (int argc, char **argv) > { > int i,j; > int n,m; > > n=10000; > m=10000; > > gsl_matrix * A = gsl_matrix_alloc(n,m); > > struct timeval tval; > gettimeofday(&tval, 0); > long int NN = (tval.tv_sec ^ tval.tv_usec) ^ getpid() ; > srand(NN); > > > for (i = 0; i < n; i++) > for (j = 0; j < m; j++) > gsl_matrix_set (A, i, j, rand()); > > FILE * f = fopen(argv[1],"wb"); > gsl_matrix_fwrite (f, A); > > fclose (f); > gsl_matrix_free (A); > > return 0; > } > > > The compilation is with : > > gcc -I/usr/include/gsl -Wall -pedantic -ggdb -std=c99 -lgsl -lgslcblas -o > matrix matrix.c > > Basically what i want is to generate an output file > (matrix_$PBS_VNODENUM.dat) for each processor . > > My qsub file : > > #PBS -S /bin/bash > #PBS -V > #PBS -N matrix2 > #PBS -q batch > #PBS -l nodes=quad4:ppn=4 > ##### #PBS -j oe > #PBS -e matrix_$PBS_JOBID.err > #PBS -o matrix_$PBS_JOBID.out > > pbsdsh -v $PBS_O_WORKDIR/matrix2 $PBS_O_WORKDIR/matrix_$PBS_VNODENUM.dat > > The problem is that only stores an output file (matrix_0.dat) . I'm looking > for some advice . > > > -- > Abraham Zamudio Ch.**** > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- Abraham Zamudio Ch. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111025/1389493d/attachment.html From abraham.zamudio at gmail.com Tue Oct 25 09:58:01 2011 From: abraham.zamudio at gmail.com (Abraham Zamudio) Date: Tue, 25 Oct 2011 10:58:01 -0500 Subject: [torqueusers] Problem with pbsdsh In-Reply-To: References: Message-ID: The ID of this job is 240 , and the log of torque : cat /var/spool/torque/server_logs/20111025 | grep 240 10/25/2011 10:47:04;0100;PBS_Server;Job;240.master;enqueuing into batch, state 1 hop 1 10/25/2011 10:47:04;0008;PBS_Server;Job;240.master;Job Queued at request of mpiX at master, owner = mpiX at master, job name = matrix2, queue = batch 10/25/2011 10:47:05;0008;PBS_Server;Job;240.master;Job Run at request of root at master 10/25/2011 10:47:05;000d;PBS_Server;Job;240.master;Not sending email: User does not want mail of this type. 10/25/2011 10:47:05;000d;PBS_Server;Job;240.master;Not sending email: User does not want mail of this type. 10/25/2011 10:47:05;0010;PBS_Server;Job;240.master;Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=3096kb resources_used.vmem=225088kb resources_used.walltime=00:00:00 10/25/2011 10:47:05;0100;PBS_Server;Job;240.master;dequeuing from batch, state COMPLETE On Tue, Oct 25, 2011 at 10:51 AM, Abraham Zamudio wrote: > Dr. Cole , > > Now , i have the following problem : > > PBS: /jro_cluster/mpiX/Matrix/script1: Permission denied > pbsdsh: error 17000 on spawn > PBS: /jro_cluster/mpiX/Matrix/script1: Permission denied > pbsdsh: error 17000 on spawn > PBS: /jro_cluster/mpiX/Matrix/script1: Permission denied > pbsdsh: error 17000 on spawn > PBS: /jro_cluster/mpiX/Matrix/script1: Permission denied > pbsdsh: error 17000 on spawn > > I do not undestand because i have no permissions . > > My qsub file : > > #PBS -S /bin/bash > #PBS -V > #PBS -N matrix2 > #PBS -q batch > #PBS -l nodes=quad4:ppn=4 > ##### #PBS -j oe > #PBS -e matrix_$PBS_JOBID.err > #PBS -o matrix_$PBS_JOBID.out > > pbsdsh $PBS_O_WORKDIR/script1 > > > And the script1 file : > > #!/bin/bash > > $PBS_O_WORKDIR/matrix2 $PBS_O_WORKDIR/matrix_$PBS_VNODENUM.dat > > > > > On Tue, Oct 25, 2011 at 10:16 AM, Coyle, James J [ITACD] wrote: > >> Abraham,**** >> >> ** ** >> >> I think $PBS_VNODENUM only gets a number inside a copy of**** >> >> a script launched by pbsdsh.**** >> >> ** ** >> >> This is similar to the value of **** >> >> n **** >> >> outside of the loop **** >> >> for(n=0;n<4;n++){**** >> >> }**** >> >> ** ** >> >> ** ** >> >> ** ** >> >> Try launching a script which uses **** >> >> matrix_$PBS_VNODENUM.dat**** >> >> ** ** >> >> internally.**** >> >> ** ** >> >> ** ** >> >> E.g.**** >> >> ** ** >> >> script1 is executable in $PBS_O_WORKDIR and contains:**** >> >> ** ** >> >> #!/bin/bash**** >> >> ** ** >> >> ./ matrix2 matrix_$PBS_VNODENUM.dat**** >> >> ** ** >> >> and change the pbsdsh command in your script to the two lines:**** >> >> ** ** >> >> cd $PBS_O_WORKDIR**** >> >> pbsdsh script1**** >> >> ** ** >> >> ** ** >> >> James Coyle, PhD**** >> >> High Performance Computing Group **** >> >> Iowa State Univ. **** >> >> web: http://jjc.public.iastate.edu/ >> **** >> >> ** ** >> >> ** ** >> >> ** ** >> >> ** ** >> >> ** ** >> >> *From:* torqueusers-bounces at supercluster.org [mailto: >> torqueusers-bounces at supercluster.org] *On Behalf Of *Abraham Zamudio >> *Sent:* Monday, October 24, 2011 4:14 PM >> *To:* Torque Users Mailing List; David Beer; Ken Nielson; tbaer at utk.edu >> *Subject:* [torqueusers] Problem with pbsdsh**** >> >> ** ** >> >> Hi people , >> >> I have a following problem . I am trying run various copies of the >> following code : >> >> #include >> #include >> #include >> #include >> #include >> >> int main (int argc, char **argv) >> { >> int i,j; >> int n,m; >> >> n=10000; >> m=10000; >> >> gsl_matrix * A = gsl_matrix_alloc(n,m); >> >> struct timeval tval; >> gettimeofday(&tval, 0); >> long int NN = (tval.tv_sec ^ tval.tv_usec) ^ getpid() ; >> srand(NN); >> >> >> for (i = 0; i < n; i++) >> for (j = 0; j < m; j++) >> gsl_matrix_set (A, i, j, rand()); >> >> FILE * f = fopen(argv[1],"wb"); >> gsl_matrix_fwrite (f, A); >> >> fclose (f); >> gsl_matrix_free (A); >> >> return 0; >> } >> >> >> The compilation is with : >> >> gcc -I/usr/include/gsl -Wall -pedantic -ggdb -std=c99 -lgsl -lgslcblas -o >> matrix matrix.c >> >> Basically what i want is to generate an output file >> (matrix_$PBS_VNODENUM.dat) for each processor . >> >> My qsub file : >> >> #PBS -S /bin/bash >> #PBS -V >> #PBS -N matrix2 >> #PBS -q batch >> #PBS -l nodes=quad4:ppn=4 >> ##### #PBS -j oe >> #PBS -e matrix_$PBS_JOBID.err >> #PBS -o matrix_$PBS_JOBID.out >> >> pbsdsh -v $PBS_O_WORKDIR/matrix2 $PBS_O_WORKDIR/matrix_$PBS_VNODENUM.dat >> >> The problem is that only stores an output file (matrix_0.dat) . I'm >> looking for some advice . >> >> >> -- >> Abraham Zamudio Ch.**** >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> > > > -- > Abraham Zamudio Ch. > > -- Abraham Zamudio Ch. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111025/1a299a7d/attachment-0001.html From torsten at synapse.sri.com Tue Oct 25 11:52:25 2011 From: torsten at synapse.sri.com (Torsten Rohlfing) Date: Tue, 25 Oct 2011 10:52:25 -0700 Subject: [torqueusers] User name wildcard not working for "authorized_users"? Message-ID: <4EA6F759.7080901@synapse.sri.com> Greetings! According to http://www.adaptivecomputing.com/resources/docs/torque/1.3advconfig.php#jobsubmithosts the "authorized_users" server option "can use the '*' wildcard to allow multiple names to be accepted with a single entry" for both the user and the host portion. I am trying this with torque 3.0.2, and while the wildcard works for the host portion, it doesn't seem to work for the user portion. That is, if I want to allow a user "torsten at axon.cluster", then I can do this with "torsten@*.cluster", but "*@axon.cluster" will not work. Am I missing something or is this potentially a bug? Many thanks! Torsten -- Torsten Rohlfing, PhD SRI International, Neuroscience Program Senior Research Scientist 333 Ravenswood Ave, Menlo Park, CA 94025 Phone: ++1 (650) 859-3379 Fax: ++1 (650) 859-2743 torsten at synapse.sri.com http://www.stanford.edu/~rohlfing/ "Though this be madness, yet there is a method in't" From stevenx.a.duchene at intel.com Tue Oct 25 18:10:13 2011 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Tue, 25 Oct 2011 17:10:13 -0700 Subject: [torqueusers] torque not listening to ppn request specs Message-ID: Hello all: I have torque 2.5.7 and maui 3.2.6p21 installed on a couple of small clusters and I am submitting the following mpi job using: qsub -l nodes=12:mynode:ppn=1 script_noarch.pbs this script is very simple as it only has one line in it to invoke the call to mpirun mpirun --machinefile $PBS_NODEFILE /home/myuser/mpi_test/mpi_hello_hostname The actual source to this is also very simple: #include #include int main(int argc, char **argv) { int *buf, i, rank, nints, len; char hostname[256]; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); gethostname(hostname,255); printf("Hello world! I am process number: %d on host %s\n", rank, hostname); MPI_Finalize(); return 0; } When I run this with the ppn=1 specification I would expect one processer per node spread over twelve nodes but when I look at my output file I see it is running multiple processes per node instead. So as a result I do not see the output from twelve unique nodes as I would expect. My nodes file has the following sorts of entries: enode01 np=4 mynode enode02 np=4 mynode enode03 np=4 mynode enode04 np=4 mynode enode05 np=4 mynode enode06 np=4 mynode enode07 np=4 mynode enode08 np=4 mynode enode09 np=4 mynode enode10 np=4 mynode enode11 np=4 mynode enode12 np=4 mynode I know I can remove the np=4 from each node specification and get the one process per node but I was under the impression that I could use the ppn=1 or whatever to get the same thing. Am I misunderstanding or overlooking something? -- Steven DuChene -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111025/ad2883d3/attachment.html From knielson at adaptivecomputing.com Wed Oct 26 11:07:28 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Wed, 26 Oct 2011 11:07:28 -0600 (MDT) Subject: [torqueusers] torque not listening to ppn request specs In-Reply-To: Message-ID: ----- Original Message ----- > From: "StevenX A DuChene" > To: torqueusers at supercluster.org > Sent: Tuesday, October 25, 2011 6:10:13 PM > Subject: [torqueusers] torque not listening to ppn request specs > > > > > > Hello all: > > I have torque 2.5.7 and maui 3.2.6p21 installed on a couple of small > clusters and I am submitting the following mpi job using: > > > > qsub -l nodes=12:mynode:ppn=1 script_noarch.pbs > > > > this script is very simple as it only has one line in it to invoke > the call to mpirun > > > > mpirun --machinefile $PBS_NODEFILE > /home/myuser/mpi_test/mpi_hello_hostname > > > > The actual source to this is also very simple: > > > > #include > > #include > > > > int main(int argc, char **argv) > > { > > int *buf, i, rank, nints, len; > > char hostname[256]; > > > > MPI_Init(&argc,&argv); > > MPI_Comm_rank(MPI_COMM_WORLD, &rank); > > gethostname(hostname,255); > > printf("Hello world! I am process number: %d on host %s\n", rank, > hostname); > > MPI_Finalize(); > > return 0; > > } > > > > When I run this with the ppn=1 specification I would expect one > processer per node spread over twelve nodes but when I look at my > output file I see it is running multiple processes per node instead. > So as a result I do not see the output from twelve unique nodes as I > would expect. > > > > My nodes file has the following sorts of entries: > > > > enode01 np=4 mynode > > enode02 np=4 mynode > > enode03 np=4 mynode > > enode04 np=4 mynode > > enode05 np=4 mynode > > enode06 np=4 mynode > > enode07 np=4 mynode > > enode08 np=4 mynode > > enode09 np=4 mynode > > enode10 np=4 mynode > > enode11 np=4 mynode > > enode12 np=4 mynode > > > > I know I can remove the np=4 from each node specification and get the > one process per node but I was under the impression that I could use > the ppn=1 or whatever to get the same thing. > > > > Am I misunderstanding or overlooking something? > > -- > Steven, Try qsub -l nodes=12:ppn=1:mynode script_noarch.pbs Ken From j.kasiak at gmail.com Wed Oct 26 14:53:23 2011 From: j.kasiak at gmail.com (Jan Kasiak) Date: Wed, 26 Oct 2011 16:53:23 -0400 Subject: [torqueusers] Queue Node Type and ppn In-Reply-To: References: <007DECE986B47F4EABF823C1FBB19C620102BBE547C5@exvic-mbx04.nexus.csiro.au> Message-ID: I have run into an issue with trying to implement a submission filter: The filter only catches the request if its: qsub myjob.sh And it does not catch the following: qsub -I -lnodes=2:ppn=2 Do I have to write wrapper to qsub as well as a submission filter? Thanks, -Jan On Mon, Oct 24, 2011 at 5:23 PM, Jan Kasiak wrote: > Gareth, > > I had started to write my own wrapper to the qsub binary. I didnt know that > torque had submit filter support. I will proceed with this option and see > how it goes. > > Many Thanks, > -Jan > > On Oct 24, 2011 4:32 PM, wrote: >> >> > -----Original Message----- >> > From: Jan Kasiak [mailto:j.kasiak at gmail.com] >> > Sent: Sunday, 23 October 2011 10:10 AM >> > To: torqueusers at supercluster.org >> > Subject: [torqueusers] Queue Node Type and ppn >> > >> > Hello Everyone, >> > >> > I'm using torque-3.0.0 and maui-3.3_pbs >> > I have searched far and wide for a solution to this problem..and I >> > can't find out how to set this up. >> > I have 3 node types: (13) * p5300, (16) * p5400 and (39) * towel (set >> > in my nodes file): >> > >> > towel01 np=12 towel >> > ... >> > p530001 np=8 p5300 >> > ... >> > p540001 np=8 p5400 >> > ... >> > >> > I want to prevent users from mixing node types for their jobs. >> > >> > I want to set up 3 queues, one for each node type, in such a way that >> > if you submit to queue p5300, it will error for the following: >> > qsub -I -q p5300 -lnodes=14:ppn=8 (more than 13 nodes) >> > qsub -I -q p5300 -lnodes=13:ppn=9 (ppn greater than available for >> > p5300) >> > qsub -I -q p5300 -lnodes=2:ppn=9 (ppn greater than available for p5300) >> > >> > Is this possible to set up? >> > >> > Thanks, >> > -Jan Kasiak >> >> Hi Jan, >> >> I'll stick my neck out and state that the only way to reject jobs based on >> the nodes and ppn as you would like is to use a submit filter. >> >> The rest can be done with reservations for each type of nodes and >> queues/classes that can access those reservations, or maybe just with types >> and defaults on queues. >> >> Gareth >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > From lloyd_brown at byu.edu Wed Oct 26 14:58:21 2011 From: lloyd_brown at byu.edu (Lloyd Brown) Date: Wed, 26 Oct 2011 14:58:21 -0600 Subject: [torqueusers] Queue Node Type and ppn In-Reply-To: References: <007DECE986B47F4EABF823C1FBB19C620102BBE547C5@exvic-mbx04.nexus.csiro.au> Message-ID: <4EA8746D.8070107@byu.edu> Are you parsing both STDIN, and the command-line args in your filter? >From the torque docs (http://www.adaptivecomputing.com/resources/docs/torque/a.jqsubwrapper.php): > When a submit filter exists, TORQUE will send the command file (or contents of STDIN if piped to qsub) to that script/executable ... > Command line arguments passed to qsub are passed as arguments to the submit filter (filter won?t see them in STDIN) in the same order and may be used as needed. Essentially, the contents of the job script are passed via STDIN, but qsub command-line args are passed as command-line args to the filter. Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu On 10/26/2011 02:53 PM, Jan Kasiak wrote: > I have run into an issue with trying to implement a submission filter: > > The filter only catches the request if its: > qsub myjob.sh > > And it does not catch the following: > qsub -I -lnodes=2:ppn=2 > > Do I have to write wrapper to qsub as well as a submission filter? > > Thanks, > -Jan > From sm4082 at nyu.edu Wed Oct 26 15:04:47 2011 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Wed, 26 Oct 2011 17:04:47 -0400 Subject: [torqueusers] Queue Node Type and ppn In-Reply-To: References: <007DECE986B47F4EABF823C1FBB19C620102BBE547C5@exvic-mbx04.nexus.csiro.au> Message-ID: Hello, Yes. It catches only qsub script. What I did was to make it check all the arguments as well as file. There are tons of options for qsub and most of the time we don't have to about ckecking all of them. Overall it got quite complicated with what I wanted wrapper to do. If you want I can send you mine just to build on it. Let me know. Sreedhar. -- Sent from my phone. Please excuse my brevity and any typos. On Oct 26, 2011, at 16:53, Jan Kasiak wrote: > I have run into an issue with trying to implement a submission filter: > > The filter only catches the request if its: > qsub myjob.sh > > And it does not catch the following: > qsub -I -lnodes=2:ppn=2 > > Do I have to write wrapper to qsub as well as a submission filter? > > Thanks, > -Jan > > On Mon, Oct 24, 2011 at 5:23 PM, Jan Kasiak wrote: >> Gareth, >> >> I had started to write my own wrapper to the qsub binary. I didnt know that >> torque had submit filter support. I will proceed with this option and see >> how it goes. >> >> Many Thanks, >> -Jan >> >> On Oct 24, 2011 4:32 PM, wrote: >>> >>>> -----Original Message----- >>>> From: Jan Kasiak [mailto:j.kasiak at gmail.com] >>>> Sent: Sunday, 23 October 2011 10:10 AM >>>> To: torqueusers at supercluster.org >>>> Subject: [torqueusers] Queue Node Type and ppn >>>> >>>> Hello Everyone, >>>> >>>> I'm using torque-3.0.0 and maui-3.3_pbs >>>> I have searched far and wide for a solution to this problem..and I >>>> can't find out how to set this up. >>>> I have 3 node types: (13) * p5300, (16) * p5400 and (39) * towel (set >>>> in my nodes file): >>>> >>>> towel01 np=12 towel >>>> ... >>>> p530001 np=8 p5300 >>>> ... >>>> p540001 np=8 p5400 >>>> ... >>>> >>>> I want to prevent users from mixing node types for their jobs. >>>> >>>> I want to set up 3 queues, one for each node type, in such a way that >>>> if you submit to queue p5300, it will error for the following: >>>> qsub -I -q p5300 -lnodes=14:ppn=8 (more than 13 nodes) >>>> qsub -I -q p5300 -lnodes=13:ppn=9 (ppn greater than available for >>>> p5300) >>>> qsub -I -q p5300 -lnodes=2:ppn=9 (ppn greater than available for p5300) >>>> >>>> Is this possible to set up? >>>> >>>> Thanks, >>>> -Jan Kasiak >>> >>> Hi Jan, >>> >>> I'll stick my neck out and state that the only way to reject jobs based on >>> the nodes and ppn as you would like is to use a submit filter. >>> >>> The rest can be done with reservations for each type of nodes and >>> queues/classes that can access those reservations, or maybe just with types >>> and defaults on queues. >>> >>> Gareth >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From j.kasiak at gmail.com Wed Oct 26 15:16:35 2011 From: j.kasiak at gmail.com (Jan Kasiak) Date: Wed, 26 Oct 2011 17:16:35 -0400 Subject: [torqueusers] Queue Node Type and ppn In-Reply-To: References: <007DECE986B47F4EABF823C1FBB19C620102BBE547C5@exvic-mbx04.nexus.csiro.au> Message-ID: I would greatly appreciate it if you could send it to me. Thanks, -Jan On Oct 26, 2011 5:05 PM, "Sreedhar Manchu" wrote: > Hello, > > Yes. It catches only qsub script. What I did was to make it check all the > arguments as well as file. There are tons of options for qsub and most of > the time we don't have to about ckecking all of them. Overall it got quite > complicated with what I wanted wrapper to do. If you want I can send you > mine just to build on it. Let me know. > > Sreedhar. > > -- > Sent from my phone. Please excuse my brevity and any typos. > > On Oct 26, 2011, at 16:53, Jan Kasiak wrote: > > > I have run into an issue with trying to implement a submission filter: > > > > The filter only catches the request if its: > > qsub myjob.sh > > > > And it does not catch the following: > > qsub -I -lnodes=2:ppn=2 > > > > Do I have to write wrapper to qsub as well as a submission filter? > > > > Thanks, > > -Jan > > > > On Mon, Oct 24, 2011 at 5:23 PM, Jan Kasiak wrote: > >> Gareth, > >> > >> I had started to write my own wrapper to the qsub binary. I didnt know > that > >> torque had submit filter support. I will proceed with this option and > see > >> how it goes. > >> > >> Many Thanks, > >> -Jan > >> > >> On Oct 24, 2011 4:32 PM, wrote: > >>> > >>>> -----Original Message----- > >>>> From: Jan Kasiak [mailto:j.kasiak at gmail.com] > >>>> Sent: Sunday, 23 October 2011 10:10 AM > >>>> To: torqueusers at supercluster.org > >>>> Subject: [torqueusers] Queue Node Type and ppn > >>>> > >>>> Hello Everyone, > >>>> > >>>> I'm using torque-3.0.0 and maui-3.3_pbs > >>>> I have searched far and wide for a solution to this problem..and I > >>>> can't find out how to set this up. > >>>> I have 3 node types: (13) * p5300, (16) * p5400 and (39) * towel (set > >>>> in my nodes file): > >>>> > >>>> towel01 np=12 towel > >>>> ... > >>>> p530001 np=8 p5300 > >>>> ... > >>>> p540001 np=8 p5400 > >>>> ... > >>>> > >>>> I want to prevent users from mixing node types for their jobs. > >>>> > >>>> I want to set up 3 queues, one for each node type, in such a way that > >>>> if you submit to queue p5300, it will error for the following: > >>>> qsub -I -q p5300 -lnodes=14:ppn=8 (more than 13 nodes) > >>>> qsub -I -q p5300 -lnodes=13:ppn=9 (ppn greater than available for > >>>> p5300) > >>>> qsub -I -q p5300 -lnodes=2:ppn=9 (ppn greater than available for > p5300) > >>>> > >>>> Is this possible to set up? > >>>> > >>>> Thanks, > >>>> -Jan Kasiak > >>> > >>> Hi Jan, > >>> > >>> I'll stick my neck out and state that the only way to reject jobs based > on > >>> the nodes and ppn as you would like is to use a submit filter. > >>> > >>> The rest can be done with reservations for each type of nodes and > >>> queues/classes that can access those reservations, or maybe just with > types > >>> and defaults on queues. > >>> > >>> Gareth > >>> _______________________________________________ > >>> torqueusers mailing list > >>> torqueusers at supercluster.org > >>> http://www.supercluster.org/mailman/listinfo/torqueusers > >> > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111026/2436e217/attachment.html From j.kasiak at gmail.com Wed Oct 26 15:31:16 2011 From: j.kasiak at gmail.com (Jan Kasiak) Date: Wed, 26 Oct 2011 17:31:16 -0400 Subject: [torqueusers] Queue Node Type and ppn In-Reply-To: References: <007DECE986B47F4EABF823C1FBB19C620102BBE547C5@exvic-mbx04.nexus.csiro.au> Message-ID: I tried a simple bash script with 'exit -1' and it didn't work on an older version of torque (2.2.1), but it does work on the machine I want with newer torque (3.0.0.). I think I'll be good from here :-) -Jan On Wed, Oct 26, 2011 at 5:16 PM, Jan Kasiak wrote: > I would greatly appreciate it if you could send it to me. > > Thanks, > -Jan > > On Oct 26, 2011 5:05 PM, "Sreedhar Manchu" wrote: >> >> Hello, >> >> Yes. It catches only qsub script. What I did was to make it check all the >> arguments as well as file. There are tons of options for qsub and most of >> the time we don't have to ?about ckecking all of them. Overall it got quite >> complicated with what I wanted wrapper to do. If you want I can send you >> mine just to build on it. Let me know. >> >> Sreedhar. >> >> -- >> Sent from my phone. Please excuse my brevity and any typos. >> >> On Oct 26, 2011, at 16:53, Jan Kasiak wrote: >> >> > I have run into an issue with trying to implement a submission filter: >> > >> > The filter only catches the request if its: >> > qsub myjob.sh >> > >> > And it does not catch the following: >> > qsub -I -lnodes=2:ppn=2 >> > >> > Do I have to write wrapper to qsub as well as a submission filter? >> > >> > Thanks, >> > -Jan >> > >> > On Mon, Oct 24, 2011 at 5:23 PM, Jan Kasiak wrote: >> >> Gareth, >> >> >> >> I had started to write my own wrapper to the qsub binary. I didnt know >> >> that >> >> torque had submit filter support. I will proceed with this option and >> >> see >> >> how it goes. >> >> >> >> Many Thanks, >> >> -Jan >> >> >> >> On Oct 24, 2011 4:32 PM, wrote: >> >>> >> >>>> -----Original Message----- >> >>>> From: Jan Kasiak [mailto:j.kasiak at gmail.com] >> >>>> Sent: Sunday, 23 October 2011 10:10 AM >> >>>> To: torqueusers at supercluster.org >> >>>> Subject: [torqueusers] Queue Node Type and ppn >> >>>> >> >>>> Hello Everyone, >> >>>> >> >>>> I'm using torque-3.0.0 and maui-3.3_pbs >> >>>> I have searched far and wide for a solution to this problem..and I >> >>>> can't find out how to set this up. >> >>>> I have 3 node types: (13) * p5300, (16) * p5400 and (39) * towel (set >> >>>> in my nodes file): >> >>>> >> >>>> towel01 np=12 towel >> >>>> ... >> >>>> p530001 np=8 p5300 >> >>>> ... >> >>>> p540001 np=8 p5400 >> >>>> ... >> >>>> >> >>>> I want to prevent users from mixing node types for their jobs. >> >>>> >> >>>> I want to set up 3 queues, one for each node type, in such a way that >> >>>> if you submit to queue p5300, it will error for the following: >> >>>> qsub -I -q p5300 -lnodes=14:ppn=8 (more than 13 nodes) >> >>>> qsub -I -q p5300 -lnodes=13:ppn=9 (ppn greater than available for >> >>>> p5300) >> >>>> qsub -I -q p5300 -lnodes=2:ppn=9 (ppn greater than available for >> >>>> p5300) >> >>>> >> >>>> Is this possible to set up? >> >>>> >> >>>> Thanks, >> >>>> -Jan Kasiak >> >>> >> >>> Hi Jan, >> >>> >> >>> I'll stick my neck out and state that the only way to reject jobs >> >>> based on >> >>> the nodes and ppn as you would like is to use a submit filter. >> >>> >> >>> The rest can be done with reservations for each type of nodes and >> >>> queues/classes that can access those reservations, or maybe just with >> >>> types >> >>> and defaults on queues. >> >>> >> >>> Gareth >> >>> _______________________________________________ >> >>> torqueusers mailing list >> >>> torqueusers at supercluster.org >> >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> >> > _______________________________________________ >> > torqueusers mailing list >> > torqueusers at supercluster.org >> > http://www.supercluster.org/mailman/listinfo/torqueusers >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > From k.mohammad1 at physics.ox.ac.uk Thu Oct 27 05:20:33 2011 From: k.mohammad1 at physics.ox.ac.uk (Kashif Mohammad) Date: Thu, 27 Oct 2011 11:20:33 +0000 Subject: [torqueusers] Stale connection error in log file of torque server Message-ID: <88B17E26E0A9F94381C67535AEF2BB5E2A685B28@EXCHNG13.physics.ox.ac.uk> Hi We are using torque and maui for glite based grid cluster. Torque and maui are installed on a separate virtual machine with 4GB RAM and two cpu and almost no grid software on it. We are using version torque-2.3.13-1.el5 and maui-3.2.6p21-snap.1234905291.5.el5 . We are seeing some stability issues with torque and maui without any apparent reason. Maui kept hanging intermittently and coming back without any intervention. showq ERROR: lost connection to server ERROR: cannot request service (status) It seems that it is not able to contact torque server because torque server is busy. I can see some entries in log file like Oct 27 07:51:43 t2torque02 pbs_server: LOG_ERROR::wait_request, connection 25 to host 0 has timed out after 900 seconds - closing stale connection Oct 27 08:35:13 t2torque02 pbs_server: LOG_ERROR::wait_request, connection 14 to host 2734753079 has timed out after 900 seconds - closing stale connection Where 2734753079 is IP address of torque server itself. Load on torque server is not high at all and it has enough free RAM also. Any suggestion please ? Thanks Kashif From lance at quantumbioinc.com Thu Oct 27 08:57:00 2011 From: lance at quantumbioinc.com (Lance Westerhoff) Date: Thu, 27 Oct 2011 10:57:00 -0400 Subject: [torqueusers] Using procs and grouping by even numbers of processors/node Message-ID: <6148E9DE-B6F4-4CCA-A41B-92539888CBAA@quantumbioinc.com> Hello All- Another question that has come up in the use of the procs qsub argument. When you run GAMESS, apparently you need to group the processors evenly among nodes. In other words, if you ask for procs=4 and 3 processors are allocated on one node and 1 allocated on the other, the job will fail since GAMESS requires that an even number of processes be available on the nodes on which it runs. In this case, the job would have been successful if 2 processes were started on one node and 2 processes were started on the other. Is there anyway to modify or augment the procs option in order to make sure that an even number of processes are started on each node? Thanks! -Lance ____________________ Lance M. Westerhoff, Ph.D. General Manager QuantumBio Inc. WWW: http://www.quantumbioinc.com Email: lance at quantumbioinc.com Phone: 814-235-6908 Fax: 814-235-6909 This message and any attachments are solely for the intended recipient and should be considered confidential. If you are not the intended recipient, please immediately and permanently delete. From stevenx.a.duchene at intel.com Thu Oct 27 09:32:18 2011 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Thu, 27 Oct 2011 08:32:18 -0700 Subject: [torqueusers] torque not listening to ppn request specs In-Reply-To: References: Message-ID: Ken: I tried that and my output file still shows that there are only 64 unique hosts being used four times each instead of 256 hosts used 1 time each. So as I said I am not getting the results out of the ppn=1 directive that I am expecting. -- Steven DuChene -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Ken Nielson Sent: Wednesday, October 26, 2011 10:07 AM To: Torque Users Mailing List Subject: Re: [torqueusers] torque not listening to ppn request specs ----- Original Message ----- > From: "StevenX A DuChene" > To: torqueusers at supercluster.org > Sent: Tuesday, October 25, 2011 6:10:13 PM > Subject: [torqueusers] torque not listening to ppn request specs > > > > > > Hello all: > > I have torque 2.5.7 and maui 3.2.6p21 installed on a couple of small > clusters and I am submitting the following mpi job using: > > > > qsub -l nodes=12:mynode:ppn=1 script_noarch.pbs > > > > this script is very simple as it only has one line in it to invoke > the call to mpirun > > > > mpirun --machinefile $PBS_NODEFILE > /home/myuser/mpi_test/mpi_hello_hostname > > > > The actual source to this is also very simple: > > > > #include > > #include > > > > int main(int argc, char **argv) > > { > > int *buf, i, rank, nints, len; > > char hostname[256]; > > > > MPI_Init(&argc,&argv); > > MPI_Comm_rank(MPI_COMM_WORLD, &rank); > > gethostname(hostname,255); > > printf("Hello world! I am process number: %d on host %s\n", rank, > hostname); > > MPI_Finalize(); > > return 0; > > } > > > > When I run this with the ppn=1 specification I would expect one > processer per node spread over twelve nodes but when I look at my > output file I see it is running multiple processes per node instead. > So as a result I do not see the output from twelve unique nodes as I > would expect. > > > > My nodes file has the following sorts of entries: > > > > enode01 np=4 mynode > > enode02 np=4 mynode > > enode03 np=4 mynode > > enode04 np=4 mynode > > enode05 np=4 mynode > > enode06 np=4 mynode > > enode07 np=4 mynode > > enode08 np=4 mynode > > enode09 np=4 mynode > > enode10 np=4 mynode > > enode11 np=4 mynode > > enode12 np=4 mynode > > > > I know I can remove the np=4 from each node specification and get the > one process per node but I was under the impression that I could use > the ppn=1 or whatever to get the same thing. > > > > Am I misunderstanding or overlooking something? > > -- > Steven, Try qsub -l nodes=12:ppn=1:mynode script_noarch.pbs Ken _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From stevenx.a.duchene at intel.com Thu Oct 27 09:48:02 2011 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Thu, 27 Oct 2011 08:48:02 -0700 Subject: [torqueusers] torque not listening to ppn request specs References: Message-ID: Is it possible that there is some maui setting that could have an effect on packing processes on nodes (one per processor) rather than spreading them out across nodes (one per node)? Some "optimization" thing I need to turn off or on? -- Steven DuChene -----Original Message----- From: DuChene, StevenX A Sent: Thursday, October 27, 2011 8:32 AM To: Torque Users Mailing List Subject: RE: [torqueusers] torque not listening to ppn request specs Ken: I tried that and my output file still shows that there are only 64 unique hosts being used four times each instead of 256 hosts used 1 time each. So as I said I am not getting the results out of the ppn=1 directive that I am expecting. -- Steven DuChene -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Ken Nielson Sent: Wednesday, October 26, 2011 10:07 AM To: Torque Users Mailing List Subject: Re: [torqueusers] torque not listening to ppn request specs ----- Original Message ----- > From: "StevenX A DuChene" > To: torqueusers at supercluster.org > Sent: Tuesday, October 25, 2011 6:10:13 PM > Subject: [torqueusers] torque not listening to ppn request specs > > > > > > Hello all: > > I have torque 2.5.7 and maui 3.2.6p21 installed on a couple of small > clusters and I am submitting the following mpi job using: > > > > qsub -l nodes=12:mynode:ppn=1 script_noarch.pbs > > > > this script is very simple as it only has one line in it to invoke > the call to mpirun > > > > mpirun --machinefile $PBS_NODEFILE > /home/myuser/mpi_test/mpi_hello_hostname > > > > The actual source to this is also very simple: > > > > #include > > #include > > > > int main(int argc, char **argv) > > { > > int *buf, i, rank, nints, len; > > char hostname[256]; > > > > MPI_Init(&argc,&argv); > > MPI_Comm_rank(MPI_COMM_WORLD, &rank); > > gethostname(hostname,255); > > printf("Hello world! I am process number: %d on host %s\n", rank, > hostname); > > MPI_Finalize(); > > return 0; > > } > > > > When I run this with the ppn=1 specification I would expect one > processer per node spread over twelve nodes but when I look at my > output file I see it is running multiple processes per node instead. > So as a result I do not see the output from twelve unique nodes as I > would expect. > > > > My nodes file has the following sorts of entries: > > > > enode01 np=4 mynode > > enode02 np=4 mynode > > enode03 np=4 mynode > > enode04 np=4 mynode > > enode05 np=4 mynode > > enode06 np=4 mynode > > enode07 np=4 mynode > > enode08 np=4 mynode > > enode09 np=4 mynode > > enode10 np=4 mynode > > enode11 np=4 mynode > > enode12 np=4 mynode > > > > I know I can remove the np=4 from each node specification and get the > one process per node but I was under the impression that I could use > the ppn=1 or whatever to get the same thing. > > > > Am I misunderstanding or overlooking something? > > -- > Steven, Try qsub -l nodes=12:ppn=1:mynode script_noarch.pbs Ken _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From stevenx.a.duchene at intel.com Thu Oct 27 10:00:59 2011 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Thu, 27 Oct 2011 09:00:59 -0700 Subject: [torqueusers] torque not listening to ppn request specs In-Reply-To: <4EA97E22.6060603@le.ac.uk> References: <4EA97E22.6060603@le.ac.uk> Message-ID: Nope; I just tried that too and got the same result. Packed processes one per processor instead of one per node. If I omit the ppn=1 option and just use tpn=1 I get the following: $ qsub -l nodes=256:SeaMicro:tpn=1 script_noarch.pbs qsub: Job exceeds queue resource limits MSG=cannot locate feasible nodes I have 256 nodes so then I tried reducing my request down to 250 but same thing $ qsub -l nodes=250:SeaMicro:tpn=1 script_noarch.pbs qsub: Job exceeds queue resource limits MSG=cannot locate feasible nodes so then I tried going down to only 25: $ qsub -l nodes=25:SeaMicro:tpn=1 script_noarch.pbs qsub: Job exceeds queue resource limits MSG=cannot locate feasible nodes WTF? -- Steven DuChene -----Original Message----- From: Jon Wakelin [mailto:jw292 at leicester.ac.uk] Sent: Thursday, October 27, 2011 8:52 AM To: DuChene, StevenX A Subject: Re: [torqueusers] torque not listening to ppn request specs On 26/10/11 01:10, DuChene, StevenX A wrote: > Hello all: > > I have torque 2.5.7 and maui 3.2.6p21 installed on a couple of small clusters > and I am submitting the following mpi job using: > > qsub -l nodes=12:mynode:ppn=1 script_noarch.pbs I think you want tpn in addition to ppn. Where tpn = tasks per node. Note the comma not semicolon. qsub -l nodes=12:ppn=1,tpn=1 script_noarch.pbs Does that give you what you want? It's best to think of nodes=12:ppn=1 as requesting 12 lots of 1 core, a request that could be satisfied by 12 single core machines or a single 12-core machine, or 2 six-core machines, and so on. It doesn't specifically request 12 physically separate nodes (servers). At least, that's what we use here. regards Jon -- Dr Jon Wakelin Systems Architect IT Services, University Of Leicester, LE1 7RH Tel. +44 (0)116 252 5147 From tbaer at utk.edu Thu Oct 27 10:04:40 2011 From: tbaer at utk.edu (Troy Baer) Date: Thu, 27 Oct 2011 12:04:40 -0400 Subject: [torqueusers] torque not listening to ppn request specs In-Reply-To: References: Message-ID: <1319731480.2548.216.camel@browncoat.jics.utk.edu> On Thu, 2011-10-27 at 08:48 -0700, DuChene, StevenX A wrote: > Is it possible that there is some maui setting that could have an > effect on packing processes on nodes (one per processor) rather than > spreading them out across nodes (one per node)? Some "optimization" > thing I need to turn off or on? Yes -- take a look at how you have JOBNODEMATCHPOLICY set. --Troy -- Troy Baer, HPC System Administrator National Institute for Computational Sciences, University of Tennessee http://www.nics.tennessee.edu/ Phone: 865-241-4233 From stevenx.a.duchene at intel.com Thu Oct 27 10:12:35 2011 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Thu, 27 Oct 2011 09:12:35 -0700 Subject: [torqueusers] torque not listening to ppn request specs In-Reply-To: <4EA981F1.8000605@le.ac.uk> References: <4EA97E22.6060603@le.ac.uk> <4EA981F1.8000605@le.ac.uk> Message-ID: Jon: Thanks for the suggestions... OK, I did: $ qsub -l nodes=256:SeaMicro:ppn=1 -l tpn=1 script_noarch.pbs 55.emcutil1.end.atom So it took the job that time but I still got packed processes instead of one per node. Only 64 unique nodes used with four processes per node instead of 256 unique nodes. GAH! -- Steven DuChene -----Original Message----- From: Jon Wakelin [mailto:jw292 at leicester.ac.uk] Sent: Thursday, October 27, 2011 9:08 AM To: DuChene, StevenX A Cc: Jon Wakelin; Torque Users Mailing List Subject: Re: [torqueusers] torque not listening to ppn request specs On 27/10/11 17:00, DuChene, StevenX A wrote: > Nope; I just tried that too and got the same result. Packed processes one per processor instead of one per node. If I omit the ppn=1 option and just use tpn=1 I get the following: > > $ qsub -l nodes=256:SeaMicro:tpn=1 script_noarch.pbs > qsub: Job exceeds queue resource limits MSG=cannot locate feasible nodes No, the tpn is a separate resource request. So you need to separate it with a comma not a semicolon qsub -l nodes=256,tpn=1 or alternatively specify -l twice qsub -l nodes=256 -l tpn=1 give it another go. We see the same behaviour as you (packed as tightly as possible) if we only specify node=x:ppn=y . does that help? Jon > > -----Original Message----- > From: Jon Wakelin [mailto:jw292 at leicester.ac.uk] > Sent: Thursday, October 27, 2011 8:52 AM > To: DuChene, StevenX A > Subject: Re: [torqueusers] torque not listening to ppn request specs > > On 26/10/11 01:10, DuChene, StevenX A wrote: >> Hello all: >> >> I have torque 2.5.7 and maui 3.2.6p21 installed on a couple of small clusters >> and I am submitting the following mpi job using: >> >> qsub -l nodes=12:mynode:ppn=1 script_noarch.pbs > > I think you want tpn in addition to ppn. Where tpn = tasks per node. Note the comma not semicolon. > > > qsub -l nodes=12:ppn=1,tpn=1 script_noarch.pbs > > > Does that give you what you want? > > It's best to think of nodes=12:ppn=1 as requesting 12 lots of 1 core, a request that could be satisfied by 12 single > core machines or a single 12-core machine, or 2 six-core machines, and so on. It doesn't specifically request 12 > physically separate nodes (servers). > > > At least, that's what we use here. > > regards > Jon -- Dr Jon Wakelin Systems Architect IT Services, University Of Leicester, LE1 7RH Tel. +44 (0)116 252 5147 From jdsmit at sandia.gov Thu Oct 27 10:18:13 2011 From: jdsmit at sandia.gov (Smith, Jerry Don II) Date: Thu, 27 Oct 2011 16:18:13 +0000 Subject: [torqueusers] torque not listening to ppn request specs In-Reply-To: Message-ID: Looking at your accounting logs, are you getting 256 nodes total, and only executing on the 64? Or are you getting just 64 nodes allocated. If getting 256 but only running on 64, you may need to add "-np 1" to your mpirun line Ie: mpirun -np 1 /path/to/ececutable Jerry On 10/27/11 10:12 AM, "DuChene, StevenX A" wrote: >Jon: > >Thanks for the suggestions... > >OK, I did: > >$ qsub -l nodes=256:SeaMicro:ppn=1 -l tpn=1 script_noarch.pbs >55.emcutil1.end.atom > >So it took the job that time but I still got packed processes instead of >one per node. Only 64 unique nodes used with four processes per node >instead of 256 unique nodes. > >GAH! >-- >Steven DuChene > > > >-----Original Message----- >From: Jon Wakelin [mailto:jw292 at leicester.ac.uk] >Sent: Thursday, October 27, 2011 9:08 AM >To: DuChene, StevenX A >Cc: Jon Wakelin; Torque Users Mailing List >Subject: Re: [torqueusers] torque not listening to ppn request specs > >On 27/10/11 17:00, DuChene, StevenX A wrote: >> Nope; I just tried that too and got the same result. Packed processes >>one per processor instead of one per node. If I omit the ppn=1 option >>and just use tpn=1 I get the following: >> >> $ qsub -l nodes=256:SeaMicro:tpn=1 script_noarch.pbs >> qsub: Job exceeds queue resource limits MSG=cannot locate feasible nodes > > >No, the tpn is a separate resource request. So you need to separate it >with a comma not a semicolon > > qsub -l nodes=256,tpn=1 > >or alternatively specify -l twice > > qsub -l nodes=256 -l tpn=1 > > >give it another go. We see the same behaviour as you (packed as tightly >as possible) if we only specify node=x:ppn=y . > >does that help? > >Jon > >> >> -----Original Message----- >> From: Jon Wakelin [mailto:jw292 at leicester.ac.uk] >> Sent: Thursday, October 27, 2011 8:52 AM >> To: DuChene, StevenX A >> Subject: Re: [torqueusers] torque not listening to ppn request specs >> >> On 26/10/11 01:10, DuChene, StevenX A wrote: >>> Hello all: >>> >>> I have torque 2.5.7 and maui 3.2.6p21 installed on a couple of small >>>clusters >>> and I am submitting the following mpi job using: >>> >>> qsub -l nodes=12:mynode:ppn=1 script_noarch.pbs >> >> I think you want tpn in addition to ppn. Where tpn = tasks per node. >>Note the comma not semicolon. >> >> >> qsub -l nodes=12:ppn=1,tpn=1 script_noarch.pbs >> >> >> Does that give you what you want? >> >> It's best to think of nodes=12:ppn=1 as requesting 12 lots of 1 core, a >>request that could be satisfied by 12 single >> core machines or a single 12-core machine, or 2 six-core machines, and >>so on. It doesn't specifically request 12 >> physically separate nodes (servers). >> >> >> At least, that's what we use here. >> >> regards >> Jon > > >-- >Dr Jon Wakelin >Systems Architect >IT Services, University Of Leicester, LE1 7RH >Tel. +44 (0)116 252 5147 >_______________________________________________ >torqueusers mailing list >torqueusers at supercluster.org >http://www.supercluster.org/mailman/listinfo/torqueusers > From stevenx.a.duchene at intel.com Thu Oct 27 10:18:23 2011 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Thu, 27 Oct 2011 09:18:23 -0700 Subject: [torqueusers] torque not listening to ppn request specs In-Reply-To: <1319731480.2548.216.camel@browncoat.jics.utk.edu> References: <1319731480.2548.216.camel@browncoat.jics.utk.edu> Message-ID: Hmm, under global policies I see: JOBNODEMATCHPOLICY[0] but under partition DEFAULT policies I see: JOBNODEMATCHPOLICY[1] Now I have to go dig around in the maui docs to figure out what these settings mean. -- Steven DuChene -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Troy Baer Sent: Thursday, October 27, 2011 9:05 AM To: Torque Users Mailing List Subject: Re: [torqueusers] torque not listening to ppn request specs On Thu, 2011-10-27 at 08:48 -0700, DuChene, StevenX A wrote: > Is it possible that there is some maui setting that could have an > effect on packing processes on nodes (one per processor) rather than > spreading them out across nodes (one per node)? Some "optimization" > thing I need to turn off or on? Yes -- take a look at how you have JOBNODEMATCHPOLICY set. --Troy -- Troy Baer, HPC System Administrator National Institute for Computational Sciences, University of Tennessee http://www.nics.tennessee.edu/ Phone: 865-241-4233 _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From jjc at iastate.edu Thu Oct 27 10:18:51 2011 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Thu, 27 Oct 2011 11:18:51 -0500 Subject: [torqueusers] torque not listening to ppn request specs In-Reply-To: References: Message-ID: Steve, If this is a question just of design and not of use, ignore the following: Getting what you want, 1 processor on N nodes. Possibilities: 1) One possibility is to try: qmgr -c 'set server node_pack = False' (I think that the default setup is True, which is what I want and use, this keeps nodes more free.) I don't know if that will give you the behavior that you want, but it does try to launch jobs on separate nodes. 2) Use nodes=20:ppn=4 and use --bynode option if you are using OpenMPI (which is what I advise users here) or if you are using another implementation of MPI that does not support --bynode or something similar, issue uniq < ${PBS_NODEFILE} > Nodefile mpirun -np 20 -machinefile Nodefile ./application (I actually supply a script mpirun1, which does this along with mpirun2, mpirun3, that supply 2, 3, etc per node for two clusters that use vendor MPI's based upon MPICH.) best of Luck, James Coyle, PhD High Performance Computing Group Iowa State Univ. web: http://jjc.public.iastate.edu/ >-----Original Message----- >From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >bounces at supercluster.org] On Behalf Of DuChene, StevenX A >Sent: Thursday, October 27, 2011 10:48 AM >To: Torque Users Mailing List >Subject: Re: [torqueusers] torque not listening to ppn request specs > >Is it possible that there is some maui setting that could have an >effect on packing processes on nodes (one per processor) rather than >spreading them out across nodes (one per node)? Some "optimization" >thing I need to turn off or on? >-- >Steven DuChene > >-----Original Message----- >From: DuChene, StevenX A >Sent: Thursday, October 27, 2011 8:32 AM >To: Torque Users Mailing List >Subject: RE: [torqueusers] torque not listening to ppn request specs > >Ken: >I tried that and my output file still shows that there are only 64 >unique hosts being used four times each instead of 256 hosts used 1 >time each. So as I said I am not getting the results out of the >ppn=1 directive that I am expecting. >-- >Steven DuChene > >-----Original Message----- >From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >bounces at supercluster.org] On Behalf Of Ken Nielson >Sent: Wednesday, October 26, 2011 10:07 AM >To: Torque Users Mailing List >Subject: Re: [torqueusers] torque not listening to ppn request specs > > > >----- Original Message ----- >> From: "StevenX A DuChene" >> To: torqueusers at supercluster.org >> Sent: Tuesday, October 25, 2011 6:10:13 PM >> Subject: [torqueusers] torque not listening to ppn request specs >> >> >> >> >> >> Hello all: >> >> I have torque 2.5.7 and maui 3.2.6p21 installed on a couple of >small >> clusters and I am submitting the following mpi job using: >> >> >> >> qsub -l nodes=12:mynode:ppn=1 script_noarch.pbs >> >> >> >> this script is very simple as it only has one line in it to invoke >> the call to mpirun >> >> >> >> mpirun --machinefile $PBS_NODEFILE >> /home/myuser/mpi_test/mpi_hello_hostname >> >> >> >> The actual source to this is also very simple: >> >> >> >> #include >> >> #include >> >> >> >> int main(int argc, char **argv) >> >> { >> >> int *buf, i, rank, nints, len; >> >> char hostname[256]; >> >> >> >> MPI_Init(&argc,&argv); >> >> MPI_Comm_rank(MPI_COMM_WORLD, &rank); >> >> gethostname(hostname,255); >> >> printf("Hello world! I am process number: %d on host %s\n", rank, >> hostname); >> >> MPI_Finalize(); >> >> return 0; >> >> } >> >> >> >> When I run this with the ppn=1 specification I would expect one >> processer per node spread over twelve nodes but when I look at my >> output file I see it is running multiple processes per node >instead. >> So as a result I do not see the output from twelve unique nodes as >I >> would expect. >> >> >> >> My nodes file has the following sorts of entries: >> >> >> >> enode01 np=4 mynode >> >> enode02 np=4 mynode >> >> enode03 np=4 mynode >> >> enode04 np=4 mynode >> >> enode05 np=4 mynode >> >> enode06 np=4 mynode >> >> enode07 np=4 mynode >> >> enode08 np=4 mynode >> >> enode09 np=4 mynode >> >> enode10 np=4 mynode >> >> enode11 np=4 mynode >> >> enode12 np=4 mynode >> >> >> >> I know I can remove the np=4 from each node specification and get >the >> one process per node but I was under the impression that I could >use >> the ppn=1 or whatever to get the same thing. >> >> >> >> Am I misunderstanding or overlooking something? >> >> -- >> > > >Steven, > >Try qsub -l nodes=12:ppn=1:mynode script_noarch.pbs > >Ken >_______________________________________________ >torqueusers mailing list >torqueusers at supercluster.org >http://www.supercluster.org/mailman/listinfo/torqueusers >_______________________________________________ >torqueusers mailing list >torqueusers at supercluster.org >http://www.supercluster.org/mailman/listinfo/torqueusers From stevenx.a.duchene at intel.com Thu Oct 27 10:53:33 2011 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Thu, 27 Oct 2011 09:53:33 -0700 Subject: [torqueusers] torque not listening to ppn request specs In-Reply-To: References: Message-ID: Thanks to all how are reading and responding to my pleas for assistance or guidance. We are a benchmarking center and I have a user who wants to start up his benchmark process across all 256 nodes, one process per node. Yes, right now I am using openmpi but later today I need to try all of this with the Intel MPI implementation. I tried doing the following: $(PBS_NODEFILE) > /home/myuser/mpi_test/cruddy256 mpirun --machinefile $PBS_NODEFILE /home/myuser/mpi_test/mpi_hello_hostname so I could try examining the nodefile I am getting from torque but all I get is a zero length file. I looked in my torque accounting logs and I see things in the execution host list of: exec_host=eatom255/3+eatom255/2+eatom255/1+eatom255/0+eatom254/3+eatom254/2+eatom254/1+eatom254/0+eatom253/3+eatom253/2+eatom253/1+eatom253/0 I copied this exec_host= stuff to a separate file and did some text munging and I only see 64 unique hosts being allocated by torque. So does that mean torque is screwing me over or could it still be some optimization being done by maui that is running as the scheduler above the torque pbs_server process? -- Steven DuChene -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Coyle, James J [ITACD] Sent: Thursday, October 27, 2011 9:19 AM To: Torque Users Mailing List Subject: Re: [torqueusers] torque not listening to ppn request specs Steve, If this is a question just of design and not of use, ignore the following: Getting what you want, 1 processor on N nodes. Possibilities: 1) One possibility is to try: qmgr -c 'set server node_pack = False' (I think that the default setup is True, which is what I want and use, this keeps nodes more free.) I don't know if that will give you the behavior that you want, but it does try to launch jobs on separate nodes. 2) Use nodes=20:ppn=4 and use --bynode option if you are using OpenMPI (which is what I advise users here) or if you are using another implementation of MPI that does not support --bynode or something similar, issue uniq < ${PBS_NODEFILE} > Nodefile mpirun -np 20 -machinefile Nodefile ./application (I actually supply a script mpirun1, which does this along with mpirun2, mpirun3, that supply 2, 3, etc per node for two clusters that use vendor MPI's based upon MPICH.) best of Luck, James Coyle, PhD High Performance Computing Group Iowa State Univ. web: http://jjc.public.iastate.edu/ >-----Original Message----- >From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >bounces at supercluster.org] On Behalf Of DuChene, StevenX A >Sent: Thursday, October 27, 2011 10:48 AM >To: Torque Users Mailing List >Subject: Re: [torqueusers] torque not listening to ppn request specs > >Is it possible that there is some maui setting that could have an >effect on packing processes on nodes (one per processor) rather than >spreading them out across nodes (one per node)? Some "optimization" >thing I need to turn off or on? >-- >Steven DuChene > >-----Original Message----- >From: DuChene, StevenX A >Sent: Thursday, October 27, 2011 8:32 AM >To: Torque Users Mailing List >Subject: RE: [torqueusers] torque not listening to ppn request specs > >Ken: >I tried that and my output file still shows that there are only 64 >unique hosts being used four times each instead of 256 hosts used 1 >time each. So as I said I am not getting the results out of the >ppn=1 directive that I am expecting. >-- >Steven DuChene > >-----Original Message----- >From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >bounces at supercluster.org] On Behalf Of Ken Nielson >Sent: Wednesday, October 26, 2011 10:07 AM >To: Torque Users Mailing List >Subject: Re: [torqueusers] torque not listening to ppn request specs > > > >----- Original Message ----- >> From: "StevenX A DuChene" >> To: torqueusers at supercluster.org >> Sent: Tuesday, October 25, 2011 6:10:13 PM >> Subject: [torqueusers] torque not listening to ppn request specs >> >> >> >> >> >> Hello all: >> >> I have torque 2.5.7 and maui 3.2.6p21 installed on a couple of >small >> clusters and I am submitting the following mpi job using: >> >> >> >> qsub -l nodes=12:mynode:ppn=1 script_noarch.pbs >> >> >> >> this script is very simple as it only has one line in it to invoke >> the call to mpirun >> >> >> >> mpirun --machinefile $PBS_NODEFILE >> /home/myuser/mpi_test/mpi_hello_hostname >> >> >> >> The actual source to this is also very simple: >> >> >> >> #include >> >> #include >> >> >> >> int main(int argc, char **argv) >> >> { >> >> int *buf, i, rank, nints, len; >> >> char hostname[256]; >> >> >> >> MPI_Init(&argc,&argv); >> >> MPI_Comm_rank(MPI_COMM_WORLD, &rank); >> >> gethostname(hostname,255); >> >> printf("Hello world! I am process number: %d on host %s\n", rank, >> hostname); >> >> MPI_Finalize(); >> >> return 0; >> >> } >> >> >> >> When I run this with the ppn=1 specification I would expect one >> processer per node spread over twelve nodes but when I look at my >> output file I see it is running multiple processes per node >instead. >> So as a result I do not see the output from twelve unique nodes as >I >> would expect. >> >> >> >> My nodes file has the following sorts of entries: >> >> >> >> enode01 np=4 mynode >> >> enode02 np=4 mynode >> >> enode03 np=4 mynode >> >> enode04 np=4 mynode >> >> enode05 np=4 mynode >> >> enode06 np=4 mynode >> >> enode07 np=4 mynode >> >> enode08 np=4 mynode >> >> enode09 np=4 mynode >> >> enode10 np=4 mynode >> >> enode11 np=4 mynode >> >> enode12 np=4 mynode >> >> >> >> I know I can remove the np=4 from each node specification and get >the >> one process per node but I was under the impression that I could >use >> the ppn=1 or whatever to get the same thing. >> >> >> >> Am I misunderstanding or overlooking something? >> >> -- >> > > >Steven, > >Try qsub -l nodes=12:ppn=1:mynode script_noarch.pbs > >Ken >_______________________________________________ >torqueusers mailing list >torqueusers at supercluster.org >http://www.supercluster.org/mailman/listinfo/torqueusers >_______________________________________________ >torqueusers mailing list >torqueusers at supercluster.org >http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From lloyd_brown at byu.edu Thu Oct 27 11:08:00 2011 From: lloyd_brown at byu.edu (Lloyd Brown) Date: Thu, 27 Oct 2011 11:08:00 -0600 Subject: [torqueusers] torque not listening to ppn request specs In-Reply-To: References: Message-ID: <4EA98FF0.9050709@byu.edu> Steve, I'm not a Maui expert (we use Moab), but it sounds like this is an optimization by the scheduler. In the end, Torque just does what the scheduler tells it to, so if it's being told to consolidate down to 64 nodes, then it will happily do so. Looking at the Maui docs, though, it does seem like the JOBNODEMATCHPOLICY has been carried over from Moab. What happens if you put something like the following in your Maui config: > JOBNODEMATCHPOLICY EXACTNODE Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu On 10/27/2011 10:53 AM, DuChene, StevenX A wrote: > Thanks to all how are reading and responding to my pleas for assistance or guidance. > > We are a benchmarking center and I have a user who wants to start up his benchmark process across all 256 nodes, one process per node. Yes, right now I am using openmpi but later today I need to try all of this with the Intel MPI implementation. > > I tried doing the following: > > $(PBS_NODEFILE) > /home/myuser/mpi_test/cruddy256 > mpirun --machinefile $PBS_NODEFILE /home/myuser/mpi_test/mpi_hello_hostname > > so I could try examining the nodefile I am getting from torque but all I get is a zero length file. > > I looked in my torque accounting logs and I see things in the execution host list of: > > exec_host=eatom255/3+eatom255/2+eatom255/1+eatom255/0+eatom254/3+eatom254/2+eatom254/1+eatom254/0+eatom253/3+eatom253/2+eatom253/1+eatom253/0 > > I copied this exec_host= stuff to a separate file and did some text munging and I only see 64 unique hosts being allocated by torque. > > So does that mean torque is screwing me over or could it still be some optimization being done by maui that is running as the scheduler above the torque pbs_server process? > -- > Steven DuChene > > -----Original Message----- > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Coyle, James J [ITACD] > Sent: Thursday, October 27, 2011 9:19 AM > To: Torque Users Mailing List > Subject: Re: [torqueusers] torque not listening to ppn request specs > > Steve, > > If this is a question just of design and not of use, > ignore the following: > > > > Getting what you want, 1 processor on N nodes. > > Possibilities: > 1) One possibility is to try: > > qmgr -c 'set server node_pack = False' > > (I think that the default setup is True, which is > what I want and use, this keeps nodes more free.) > I don't know if that will give you the behavior that > you want, but it does try to launch jobs on separate > nodes. > > 2) Use nodes=20:ppn=4 and use --bynode option if you are using > OpenMPI (which is what I advise users here) or > if you are using another implementation of MPI that does > not support --bynode or something similar, issue > > uniq < ${PBS_NODEFILE} > Nodefile > mpirun -np 20 -machinefile Nodefile ./application > > (I actually supply a script mpirun1, which does this along > with mpirun2, mpirun3, that supply 2, 3, etc per node for > two clusters that use vendor MPI's based upon MPICH.) > > best of Luck, > James Coyle, PhD > High Performance Computing Group > Iowa State Univ. > web: http://jjc.public.iastate.edu/ > >> -----Original Message----- >> From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >> bounces at supercluster.org] On Behalf Of DuChene, StevenX A >> Sent: Thursday, October 27, 2011 10:48 AM >> To: Torque Users Mailing List >> Subject: Re: [torqueusers] torque not listening to ppn request specs >> >> Is it possible that there is some maui setting that could have an >> effect on packing processes on nodes (one per processor) rather than >> spreading them out across nodes (one per node)? Some "optimization" >> thing I need to turn off or on? >> -- >> Steven DuChene >> >> -----Original Message----- >> From: DuChene, StevenX A >> Sent: Thursday, October 27, 2011 8:32 AM >> To: Torque Users Mailing List >> Subject: RE: [torqueusers] torque not listening to ppn request specs >> >> Ken: >> I tried that and my output file still shows that there are only 64 >> unique hosts being used four times each instead of 256 hosts used 1 >> time each. So as I said I am not getting the results out of the >> ppn=1 directive that I am expecting. >> -- >> Steven DuChene >> >> -----Original Message----- >> From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >> bounces at supercluster.org] On Behalf Of Ken Nielson >> Sent: Wednesday, October 26, 2011 10:07 AM >> To: Torque Users Mailing List >> Subject: Re: [torqueusers] torque not listening to ppn request specs >> >> >> >> ----- Original Message ----- >>> From: "StevenX A DuChene" >>> To: torqueusers at supercluster.org >>> Sent: Tuesday, October 25, 2011 6:10:13 PM >>> Subject: [torqueusers] torque not listening to ppn request specs >>> >>> >>> >>> >>> >>> Hello all: >>> >>> I have torque 2.5.7 and maui 3.2.6p21 installed on a couple of >> small >>> clusters and I am submitting the following mpi job using: >>> >>> >>> >>> qsub -l nodes=12:mynode:ppn=1 script_noarch.pbs >>> >>> >>> >>> this script is very simple as it only has one line in it to invoke >>> the call to mpirun >>> >>> >>> >>> mpirun --machinefile $PBS_NODEFILE >>> /home/myuser/mpi_test/mpi_hello_hostname >>> >>> >>> >>> The actual source to this is also very simple: >>> >>> >>> >>> #include >>> >>> #include >>> >>> >>> >>> int main(int argc, char **argv) >>> >>> { >>> >>> int *buf, i, rank, nints, len; >>> >>> char hostname[256]; >>> >>> >>> >>> MPI_Init(&argc,&argv); >>> >>> MPI_Comm_rank(MPI_COMM_WORLD, &rank); >>> >>> gethostname(hostname,255); >>> >>> printf("Hello world! I am process number: %d on host %s\n", rank, >>> hostname); >>> >>> MPI_Finalize(); >>> >>> return 0; >>> >>> } >>> >>> >>> >>> When I run this with the ppn=1 specification I would expect one >>> processer per node spread over twelve nodes but when I look at my >>> output file I see it is running multiple processes per node >> instead. >>> So as a result I do not see the output from twelve unique nodes as >> I >>> would expect. >>> >>> >>> >>> My nodes file has the following sorts of entries: >>> >>> >>> >>> enode01 np=4 mynode >>> >>> enode02 np=4 mynode >>> >>> enode03 np=4 mynode >>> >>> enode04 np=4 mynode >>> >>> enode05 np=4 mynode >>> >>> enode06 np=4 mynode >>> >>> enode07 np=4 mynode >>> >>> enode08 np=4 mynode >>> >>> enode09 np=4 mynode >>> >>> enode10 np=4 mynode >>> >>> enode11 np=4 mynode >>> >>> enode12 np=4 mynode >>> >>> >>> >>> I know I can remove the np=4 from each node specification and get >> the >>> one process per node but I was under the impression that I could >> use >>> the ppn=1 or whatever to get the same thing. >>> >>> >>> >>> Am I misunderstanding or overlooking something? >>> >>> -- >>> >> >> >> Steven, >> >> Try qsub -l nodes=12:ppn=1:mynode script_noarch.pbs >> >> Ken >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From stevenx.a.duchene at intel.com Thu Oct 27 11:18:46 2011 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Thu, 27 Oct 2011 10:18:46 -0700 Subject: [torqueusers] torque not listening to ppn request specs In-Reply-To: <4EA98FF0.9050709@byu.edu> References: <4EA98FF0.9050709@byu.edu> Message-ID: Cool! Thanks Lloyd! That seems to have done the trick. I got 256 unique nodes this time instead of 64. However does setting this policy in my maui.cfg file mean it will never pack processes if that is actually what a user intends? -- Steven DuChene -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Lloyd Brown Sent: Thursday, October 27, 2011 10:08 AM To: torqueusers at supercluster.org Subject: Re: [torqueusers] torque not listening to ppn request specs Steve, I'm not a Maui expert (we use Moab), but it sounds like this is an optimization by the scheduler. In the end, Torque just does what the scheduler tells it to, so if it's being told to consolidate down to 64 nodes, then it will happily do so. Looking at the Maui docs, though, it does seem like the JOBNODEMATCHPOLICY has been carried over from Moab. What happens if you put something like the following in your Maui config: > JOBNODEMATCHPOLICY EXACTNODE Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu On 10/27/2011 10:53 AM, DuChene, StevenX A wrote: > Thanks to all how are reading and responding to my pleas for assistance or guidance. > > We are a benchmarking center and I have a user who wants to start up his benchmark process across all 256 nodes, one process per node. Yes, right now I am using openmpi but later today I need to try all of this with the Intel MPI implementation. > > I tried doing the following: > > $(PBS_NODEFILE) > /home/myuser/mpi_test/cruddy256 > mpirun --machinefile $PBS_NODEFILE /home/myuser/mpi_test/mpi_hello_hostname > > so I could try examining the nodefile I am getting from torque but all I get is a zero length file. > > I looked in my torque accounting logs and I see things in the execution host list of: > > exec_host=eatom255/3+eatom255/2+eatom255/1+eatom255/0+eatom254/3+eatom254/2+eatom254/1+eatom254/0+eatom253/3+eatom253/2+eatom253/1+eatom253/0 > > I copied this exec_host= stuff to a separate file and did some text munging and I only see 64 unique hosts being allocated by torque. > > So does that mean torque is screwing me over or could it still be some optimization being done by maui that is running as the scheduler above the torque pbs_server process? > -- > Steven DuChene > > -----Original Message----- > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Coyle, James J [ITACD] > Sent: Thursday, October 27, 2011 9:19 AM > To: Torque Users Mailing List > Subject: Re: [torqueusers] torque not listening to ppn request specs > > Steve, > > If this is a question just of design and not of use, > ignore the following: > > > > Getting what you want, 1 processor on N nodes. > > Possibilities: > 1) One possibility is to try: > > qmgr -c 'set server node_pack = False' > > (I think that the default setup is True, which is > what I want and use, this keeps nodes more free.) > I don't know if that will give you the behavior that > you want, but it does try to launch jobs on separate > nodes. > > 2) Use nodes=20:ppn=4 and use --bynode option if you are using > OpenMPI (which is what I advise users here) or > if you are using another implementation of MPI that does > not support --bynode or something similar, issue > > uniq < ${PBS_NODEFILE} > Nodefile > mpirun -np 20 -machinefile Nodefile ./application > > (I actually supply a script mpirun1, which does this along > with mpirun2, mpirun3, that supply 2, 3, etc per node for > two clusters that use vendor MPI's based upon MPICH.) > > best of Luck, > James Coyle, PhD > High Performance Computing Group > Iowa State Univ. > web: http://jjc.public.iastate.edu/ > >> -----Original Message----- >> From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >> bounces at supercluster.org] On Behalf Of DuChene, StevenX A >> Sent: Thursday, October 27, 2011 10:48 AM >> To: Torque Users Mailing List >> Subject: Re: [torqueusers] torque not listening to ppn request specs >> >> Is it possible that there is some maui setting that could have an >> effect on packing processes on nodes (one per processor) rather than >> spreading them out across nodes (one per node)? Some "optimization" >> thing I need to turn off or on? >> -- >> Steven DuChene >> >> -----Original Message----- >> From: DuChene, StevenX A >> Sent: Thursday, October 27, 2011 8:32 AM >> To: Torque Users Mailing List >> Subject: RE: [torqueusers] torque not listening to ppn request specs >> >> Ken: >> I tried that and my output file still shows that there are only 64 >> unique hosts being used four times each instead of 256 hosts used 1 >> time each. So as I said I am not getting the results out of the >> ppn=1 directive that I am expecting. >> -- >> Steven DuChene >> >> -----Original Message----- >> From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >> bounces at supercluster.org] On Behalf Of Ken Nielson >> Sent: Wednesday, October 26, 2011 10:07 AM >> To: Torque Users Mailing List >> Subject: Re: [torqueusers] torque not listening to ppn request specs >> >> >> >> ----- Original Message ----- >>> From: "StevenX A DuChene" >>> To: torqueusers at supercluster.org >>> Sent: Tuesday, October 25, 2011 6:10:13 PM >>> Subject: [torqueusers] torque not listening to ppn request specs >>> >>> >>> >>> >>> >>> Hello all: >>> >>> I have torque 2.5.7 and maui 3.2.6p21 installed on a couple of >> small >>> clusters and I am submitting the following mpi job using: >>> >>> >>> >>> qsub -l nodes=12:mynode:ppn=1 script_noarch.pbs >>> >>> >>> >>> this script is very simple as it only has one line in it to invoke >>> the call to mpirun >>> >>> >>> >>> mpirun --machinefile $PBS_NODEFILE >>> /home/myuser/mpi_test/mpi_hello_hostname >>> >>> >>> >>> The actual source to this is also very simple: >>> >>> >>> >>> #include >>> >>> #include >>> >>> >>> >>> int main(int argc, char **argv) >>> >>> { >>> >>> int *buf, i, rank, nints, len; >>> >>> char hostname[256]; >>> >>> >>> >>> MPI_Init(&argc,&argv); >>> >>> MPI_Comm_rank(MPI_COMM_WORLD, &rank); >>> >>> gethostname(hostname,255); >>> >>> printf("Hello world! I am process number: %d on host %s\n", rank, >>> hostname); >>> >>> MPI_Finalize(); >>> >>> return 0; >>> >>> } >>> >>> >>> >>> When I run this with the ppn=1 specification I would expect one >>> processer per node spread over twelve nodes but when I look at my >>> output file I see it is running multiple processes per node >> instead. >>> So as a result I do not see the output from twelve unique nodes as >> I >>> would expect. >>> >>> >>> >>> My nodes file has the following sorts of entries: >>> >>> >>> >>> enode01 np=4 mynode >>> >>> enode02 np=4 mynode >>> >>> enode03 np=4 mynode >>> >>> enode04 np=4 mynode >>> >>> enode05 np=4 mynode >>> >>> enode06 np=4 mynode >>> >>> enode07 np=4 mynode >>> >>> enode08 np=4 mynode >>> >>> enode09 np=4 mynode >>> >>> enode10 np=4 mynode >>> >>> enode11 np=4 mynode >>> >>> enode12 np=4 mynode >>> >>> >>> >>> I know I can remove the np=4 from each node specification and get >> the >>> one process per node but I was under the impression that I could >> use >>> the ppn=1 or whatever to get the same thing. >>> >>> >>> >>> Am I misunderstanding or overlooking something? >>> >>> -- >>> >> >> >> Steven, >> >> Try qsub -l nodes=12:ppn=1:mynode script_noarch.pbs >> >> Ken >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From lloyd_brown at byu.edu Thu Oct 27 11:25:42 2011 From: lloyd_brown at byu.edu (Lloyd Brown) Date: Thu, 27 Oct 2011 11:25:42 -0600 Subject: [torqueusers] torque not listening to ppn request specs In-Reply-To: References: <4EA98FF0.9050709@byu.edu> Message-ID: <4EA99416.9070907@byu.edu> Well, there are a couple of approaches, at least with Moab (again, never actually used Maui; YMMV): - Make packing the default (unset JOBNODEMATCHPOLICY), and append the "-W x=nmatchpolicy:exactnode" to the job, either as a parameter to qsub, or as a "#PBS -w x=...." line in your script - Make exactnode the default, and have people who don't care about the exact layout use the "procs=x" syntax, instead of the "nodes=x:ppn=y" syntax. Again, not sure if these work with Maui, but they're worth a try. Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu On 10/27/2011 11:18 AM, DuChene, StevenX A wrote: > Cool! Thanks Lloyd! That seems to have done the trick. I got 256 unique nodes this time instead of 64. > > However does setting this policy in my maui.cfg file mean it will never pack processes if that is actually what a user intends? > -- > Steven DuChene > > -----Original Message----- > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Lloyd Brown > Sent: Thursday, October 27, 2011 10:08 AM > To: torqueusers at supercluster.org > Subject: Re: [torqueusers] torque not listening to ppn request specs > > Steve, > > I'm not a Maui expert (we use Moab), but it sounds like this is an > optimization by the scheduler. In the end, Torque just does what the > scheduler tells it to, so if it's being told to consolidate down to 64 > nodes, then it will happily do so. > > Looking at the Maui docs, though, it does seem like the > JOBNODEMATCHPOLICY has been carried over from Moab. What happens if you > put something like the following in your Maui config: > >> JOBNODEMATCHPOLICY EXACTNODE > > > > Lloyd Brown > Systems Administrator > Fulton Supercomputing Lab > Brigham Young University > http://marylou.byu.edu > > On 10/27/2011 10:53 AM, DuChene, StevenX A wrote: >> Thanks to all how are reading and responding to my pleas for assistance or guidance. >> >> We are a benchmarking center and I have a user who wants to start up his benchmark process across all 256 nodes, one process per node. Yes, right now I am using openmpi but later today I need to try all of this with the Intel MPI implementation. >> >> I tried doing the following: >> >> $(PBS_NODEFILE) > /home/myuser/mpi_test/cruddy256 >> mpirun --machinefile $PBS_NODEFILE /home/myuser/mpi_test/mpi_hello_hostname >> >> so I could try examining the nodefile I am getting from torque but all I get is a zero length file. >> >> I looked in my torque accounting logs and I see things in the execution host list of: >> >> exec_host=eatom255/3+eatom255/2+eatom255/1+eatom255/0+eatom254/3+eatom254/2+eatom254/1+eatom254/0+eatom253/3+eatom253/2+eatom253/1+eatom253/0 >> >> I copied this exec_host= stuff to a separate file and did some text munging and I only see 64 unique hosts being allocated by torque. >> >> So does that mean torque is screwing me over or could it still be some optimization being done by maui that is running as the scheduler above the torque pbs_server process? >> -- >> Steven DuChene >> >> -----Original Message----- >> From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Coyle, James J [ITACD] >> Sent: Thursday, October 27, 2011 9:19 AM >> To: Torque Users Mailing List >> Subject: Re: [torqueusers] torque not listening to ppn request specs >> >> Steve, >> >> If this is a question just of design and not of use, >> ignore the following: >> >> >> >> Getting what you want, 1 processor on N nodes. >> >> Possibilities: >> 1) One possibility is to try: >> >> qmgr -c 'set server node_pack = False' >> >> (I think that the default setup is True, which is >> what I want and use, this keeps nodes more free.) >> I don't know if that will give you the behavior that >> you want, but it does try to launch jobs on separate >> nodes. >> >> 2) Use nodes=20:ppn=4 and use --bynode option if you are using >> OpenMPI (which is what I advise users here) or >> if you are using another implementation of MPI that does >> not support --bynode or something similar, issue >> >> uniq < ${PBS_NODEFILE} > Nodefile >> mpirun -np 20 -machinefile Nodefile ./application >> >> (I actually supply a script mpirun1, which does this along >> with mpirun2, mpirun3, that supply 2, 3, etc per node for >> two clusters that use vendor MPI's based upon MPICH.) >> >> best of Luck, >> James Coyle, PhD >> High Performance Computing Group >> Iowa State Univ. >> web: http://jjc.public.iastate.edu/ >> >>> -----Original Message----- >>> From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >>> bounces at supercluster.org] On Behalf Of DuChene, StevenX A >>> Sent: Thursday, October 27, 2011 10:48 AM >>> To: Torque Users Mailing List >>> Subject: Re: [torqueusers] torque not listening to ppn request specs >>> >>> Is it possible that there is some maui setting that could have an >>> effect on packing processes on nodes (one per processor) rather than >>> spreading them out across nodes (one per node)? Some "optimization" >>> thing I need to turn off or on? >>> -- >>> Steven DuChene >>> >>> -----Original Message----- >>> From: DuChene, StevenX A >>> Sent: Thursday, October 27, 2011 8:32 AM >>> To: Torque Users Mailing List >>> Subject: RE: [torqueusers] torque not listening to ppn request specs >>> >>> Ken: >>> I tried that and my output file still shows that there are only 64 >>> unique hosts being used four times each instead of 256 hosts used 1 >>> time each. So as I said I am not getting the results out of the >>> ppn=1 directive that I am expecting. >>> -- >>> Steven DuChene >>> >>> -----Original Message----- >>> From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >>> bounces at supercluster.org] On Behalf Of Ken Nielson >>> Sent: Wednesday, October 26, 2011 10:07 AM >>> To: Torque Users Mailing List >>> Subject: Re: [torqueusers] torque not listening to ppn request specs >>> >>> >>> >>> ----- Original Message ----- >>>> From: "StevenX A DuChene" >>>> To: torqueusers at supercluster.org >>>> Sent: Tuesday, October 25, 2011 6:10:13 PM >>>> Subject: [torqueusers] torque not listening to ppn request specs >>>> >>>> >>>> >>>> >>>> >>>> Hello all: >>>> >>>> I have torque 2.5.7 and maui 3.2.6p21 installed on a couple of >>> small >>>> clusters and I am submitting the following mpi job using: >>>> >>>> >>>> >>>> qsub -l nodes=12:mynode:ppn=1 script_noarch.pbs >>>> >>>> >>>> >>>> this script is very simple as it only has one line in it to invoke >>>> the call to mpirun >>>> >>>> >>>> >>>> mpirun --machinefile $PBS_NODEFILE >>>> /home/myuser/mpi_test/mpi_hello_hostname >>>> >>>> >>>> >>>> The actual source to this is also very simple: >>>> >>>> >>>> >>>> #include >>>> >>>> #include >>>> >>>> >>>> >>>> int main(int argc, char **argv) >>>> >>>> { >>>> >>>> int *buf, i, rank, nints, len; >>>> >>>> char hostname[256]; >>>> >>>> >>>> >>>> MPI_Init(&argc,&argv); >>>> >>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank); >>>> >>>> gethostname(hostname,255); >>>> >>>> printf("Hello world! I am process number: %d on host %s\n", rank, >>>> hostname); >>>> >>>> MPI_Finalize(); >>>> >>>> return 0; >>>> >>>> } >>>> >>>> >>>> >>>> When I run this with the ppn=1 specification I would expect one >>>> processer per node spread over twelve nodes but when I look at my >>>> output file I see it is running multiple processes per node >>> instead. >>>> So as a result I do not see the output from twelve unique nodes as >>> I >>>> would expect. >>>> >>>> >>>> >>>> My nodes file has the following sorts of entries: >>>> >>>> >>>> >>>> enode01 np=4 mynode >>>> >>>> enode02 np=4 mynode >>>> >>>> enode03 np=4 mynode >>>> >>>> enode04 np=4 mynode >>>> >>>> enode05 np=4 mynode >>>> >>>> enode06 np=4 mynode >>>> >>>> enode07 np=4 mynode >>>> >>>> enode08 np=4 mynode >>>> >>>> enode09 np=4 mynode >>>> >>>> enode10 np=4 mynode >>>> >>>> enode11 np=4 mynode >>>> >>>> enode12 np=4 mynode >>>> >>>> >>>> >>>> I know I can remove the np=4 from each node specification and get >>> the >>>> one process per node but I was under the impression that I could >>> use >>>> the ppn=1 or whatever to get the same thing. >>>> >>>> >>>> >>>> Am I misunderstanding or overlooking something? >>>> >>>> -- >>>> >>> >>> >>> Steven, >>> >>> Try qsub -l nodes=12:ppn=1:mynode script_noarch.pbs >>> >>> Ken >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From stevenx.a.duchene at intel.com Thu Oct 27 12:38:45 2011 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Thu, 27 Oct 2011 11:38:45 -0700 Subject: [torqueusers] torque not listening to ppn request specs In-Reply-To: <4EA99416.9070907@byu.edu> References: <4EA98FF0.9050709@byu.edu> <4EA99416.9070907@byu.edu> Message-ID: I am looking at the torque documentation for the qsub command. Specifically I am looking at what it says about the -W option. The documentation lists several additional attributes that torque supports but I do not see any mention of a x=blah in that list of supported additional attributes. I have tried this method of taking the "JOBNODEMATCHPOLICY EXACTNODE" out of the maui.cfg file and then doing my qsub with the following: qsub -l nodes=256:SeaMicro:ppn=1 -W x=nmatchpolicy:exactnode script_noarch.pbs but I do not get the 256 unique nodes I asked for. I am back to 64 nodes with four processes packed per node. -- Steven DuChene -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Lloyd Brown Sent: Thursday, October 27, 2011 10:26 AM To: torqueusers at supercluster.org Subject: Re: [torqueusers] torque not listening to ppn request specs Well, there are a couple of approaches, at least with Moab (again, never actually used Maui; YMMV): - Make packing the default (unset JOBNODEMATCHPOLICY), and append the "-W x=nmatchpolicy:exactnode" to the job, either as a parameter to qsub, or as a "#PBS -w x=...." line in your script - Make exactnode the default, and have people who don't care about the exact layout use the "procs=x" syntax, instead of the "nodes=x:ppn=y" syntax. Again, not sure if these work with Maui, but they're worth a try. Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu On 10/27/2011 11:18 AM, DuChene, StevenX A wrote: > Cool! Thanks Lloyd! That seems to have done the trick. I got 256 unique nodes this time instead of 64. > > However does setting this policy in my maui.cfg file mean it will never pack processes if that is actually what a user intends? > -- > Steven DuChene > > -----Original Message----- > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Lloyd Brown > Sent: Thursday, October 27, 2011 10:08 AM > To: torqueusers at supercluster.org > Subject: Re: [torqueusers] torque not listening to ppn request specs > > Steve, > > I'm not a Maui expert (we use Moab), but it sounds like this is an > optimization by the scheduler. In the end, Torque just does what the > scheduler tells it to, so if it's being told to consolidate down to 64 > nodes, then it will happily do so. > > Looking at the Maui docs, though, it does seem like the > JOBNODEMATCHPOLICY has been carried over from Moab. What happens if you > put something like the following in your Maui config: > >> JOBNODEMATCHPOLICY EXACTNODE > > > > Lloyd Brown > Systems Administrator > Fulton Supercomputing Lab > Brigham Young University > http://marylou.byu.edu > > On 10/27/2011 10:53 AM, DuChene, StevenX A wrote: >> Thanks to all how are reading and responding to my pleas for assistance or guidance. >> >> We are a benchmarking center and I have a user who wants to start up his benchmark process across all 256 nodes, one process per node. Yes, right now I am using openmpi but later today I need to try all of this with the Intel MPI implementation. >> >> I tried doing the following: >> >> $(PBS_NODEFILE) > /home/myuser/mpi_test/cruddy256 >> mpirun --machinefile $PBS_NODEFILE /home/myuser/mpi_test/mpi_hello_hostname >> >> so I could try examining the nodefile I am getting from torque but all I get is a zero length file. >> >> I looked in my torque accounting logs and I see things in the execution host list of: >> >> exec_host=eatom255/3+eatom255/2+eatom255/1+eatom255/0+eatom254/3+eatom254/2+eatom254/1+eatom254/0+eatom253/3+eatom253/2+eatom253/1+eatom253/0 >> >> I copied this exec_host= stuff to a separate file and did some text munging and I only see 64 unique hosts being allocated by torque. >> >> So does that mean torque is screwing me over or could it still be some optimization being done by maui that is running as the scheduler above the torque pbs_server process? >> -- >> Steven DuChene >> >> -----Original Message----- >> From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Coyle, James J [ITACD] >> Sent: Thursday, October 27, 2011 9:19 AM >> To: Torque Users Mailing List >> Subject: Re: [torqueusers] torque not listening to ppn request specs >> >> Steve, >> >> If this is a question just of design and not of use, >> ignore the following: >> >> >> >> Getting what you want, 1 processor on N nodes. >> >> Possibilities: >> 1) One possibility is to try: >> >> qmgr -c 'set server node_pack = False' >> >> (I think that the default setup is True, which is >> what I want and use, this keeps nodes more free.) >> I don't know if that will give you the behavior that >> you want, but it does try to launch jobs on separate >> nodes. >> >> 2) Use nodes=20:ppn=4 and use --bynode option if you are using >> OpenMPI (which is what I advise users here) or >> if you are using another implementation of MPI that does >> not support --bynode or something similar, issue >> >> uniq < ${PBS_NODEFILE} > Nodefile >> mpirun -np 20 -machinefile Nodefile ./application >> >> (I actually supply a script mpirun1, which does this along >> with mpirun2, mpirun3, that supply 2, 3, etc per node for >> two clusters that use vendor MPI's based upon MPICH.) >> >> best of Luck, >> James Coyle, PhD >> High Performance Computing Group >> Iowa State Univ. >> web: http://jjc.public.iastate.edu/ >> >>> -----Original Message----- >>> From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >>> bounces at supercluster.org] On Behalf Of DuChene, StevenX A >>> Sent: Thursday, October 27, 2011 10:48 AM >>> To: Torque Users Mailing List >>> Subject: Re: [torqueusers] torque not listening to ppn request specs >>> >>> Is it possible that there is some maui setting that could have an >>> effect on packing processes on nodes (one per processor) rather than >>> spreading them out across nodes (one per node)? Some "optimization" >>> thing I need to turn off or on? >>> -- >>> Steven DuChene >>> >>> -----Original Message----- >>> From: DuChene, StevenX A >>> Sent: Thursday, October 27, 2011 8:32 AM >>> To: Torque Users Mailing List >>> Subject: RE: [torqueusers] torque not listening to ppn request specs >>> >>> Ken: >>> I tried that and my output file still shows that there are only 64 >>> unique hosts being used four times each instead of 256 hosts used 1 >>> time each. So as I said I am not getting the results out of the >>> ppn=1 directive that I am expecting. >>> -- >>> Steven DuChene >>> >>> -----Original Message----- >>> From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >>> bounces at supercluster.org] On Behalf Of Ken Nielson >>> Sent: Wednesday, October 26, 2011 10:07 AM >>> To: Torque Users Mailing List >>> Subject: Re: [torqueusers] torque not listening to ppn request specs >>> >>> >>> >>> ----- Original Message ----- >>>> From: "StevenX A DuChene" >>>> To: torqueusers at supercluster.org >>>> Sent: Tuesday, October 25, 2011 6:10:13 PM >>>> Subject: [torqueusers] torque not listening to ppn request specs >>>> >>>> >>>> >>>> >>>> >>>> Hello all: >>>> >>>> I have torque 2.5.7 and maui 3.2.6p21 installed on a couple of >>> small >>>> clusters and I am submitting the following mpi job using: >>>> >>>> >>>> >>>> qsub -l nodes=12:mynode:ppn=1 script_noarch.pbs >>>> >>>> >>>> >>>> this script is very simple as it only has one line in it to invoke >>>> the call to mpirun >>>> >>>> >>>> >>>> mpirun --machinefile $PBS_NODEFILE >>>> /home/myuser/mpi_test/mpi_hello_hostname >>>> >>>> >>>> >>>> The actual source to this is also very simple: >>>> >>>> >>>> >>>> #include >>>> >>>> #include >>>> >>>> >>>> >>>> int main(int argc, char **argv) >>>> >>>> { >>>> >>>> int *buf, i, rank, nints, len; >>>> >>>> char hostname[256]; >>>> >>>> >>>> >>>> MPI_Init(&argc,&argv); >>>> >>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank); >>>> >>>> gethostname(hostname,255); >>>> >>>> printf("Hello world! I am process number: %d on host %s\n", rank, >>>> hostname); >>>> >>>> MPI_Finalize(); >>>> >>>> return 0; >>>> >>>> } >>>> >>>> >>>> >>>> When I run this with the ppn=1 specification I would expect one >>>> processer per node spread over twelve nodes but when I look at my >>>> output file I see it is running multiple processes per node >>> instead. >>>> So as a result I do not see the output from twelve unique nodes as >>> I >>>> would expect. >>>> >>>> >>>> >>>> My nodes file has the following sorts of entries: >>>> >>>> >>>> >>>> enode01 np=4 mynode >>>> >>>> enode02 np=4 mynode >>>> >>>> enode03 np=4 mynode >>>> >>>> enode04 np=4 mynode >>>> >>>> enode05 np=4 mynode >>>> >>>> enode06 np=4 mynode >>>> >>>> enode07 np=4 mynode >>>> >>>> enode08 np=4 mynode >>>> >>>> enode09 np=4 mynode >>>> >>>> enode10 np=4 mynode >>>> >>>> enode11 np=4 mynode >>>> >>>> enode12 np=4 mynode >>>> >>>> >>>> >>>> I know I can remove the np=4 from each node specification and get >>> the >>>> one process per node but I was under the impression that I could >>> use >>>> the ppn=1 or whatever to get the same thing. >>>> >>>> >>>> >>>> Am I misunderstanding or overlooking something? >>>> >>>> -- >>>> >>> >>> >>> Steven, >>> >>> Try qsub -l nodes=12:ppn=1:mynode script_noarch.pbs >>> >>> Ken >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From avb at ssau.ru Thu Oct 27 09:24:53 2011 From: avb at ssau.ru (Alexandr Baskakov) Date: Thu, 27 Oct 2011 19:24:53 +0400 Subject: [torqueusers] Allocate job on node only if it requested by job node properties Message-ID: <4EA977C5.9030000@ssau.ru> Hi. Can anybody help me, please. I need configure TORQUE 3.0.2/Maui 3.3 to allocate particular job on node, only if node has been requested by job node properties. For example, we have 3 nodes: n1 np=8 n2 np=8 n3 np=8 bigmem If I submit job with "-l nodes=2:ppn=8", it must run on n1 and n2. If nodes n1,n2 already executing a some jobs, then if I submit job with "-l nodes=2:ppn=8", it must been queued, and wait n1,n2. If I submit job with "-l nodes=2:ppn=8:bigmem", it must be run on n3,n1 or n3,n2. Other words, regular jobs, if node properties "bigmem" has not requested, should't be run on node n3. -- Alexandr Baskakov, Samara State Aerospace University e-mail: avb at ssau.ru From pat.o'bryant at exxonmobil.com Thu Oct 27 13:05:40 2011 From: pat.o'bryant at exxonmobil.com (O'Bryant, Pat) Date: Thu, 27 Oct 2011 14:05:40 -0500 Subject: [torqueusers] torque not listening to ppn request specs In-Reply-To: References: <4EA98FF0.9050709@byu.edu> <4EA99416.9070907@byu.edu> Message-ID: Steven, How about this: qsub -l nodes=256:SeaMicro,tpn=1 script_noarch.pbs Note that "tpn" means tasks per node. Note that it has a comma preceding it. The "ppn" value can be hard to figure out. -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of DuChene, StevenX A Sent: Thursday, October 27, 2011 1:39 PM To: Torque Users Mailing List Subject: Re: [torqueusers] torque not listening to ppn request specs I am looking at the torque documentation for the qsub command. Specifically I am looking at what it says about the -W option. The documentation lists several additional attributes that torque supports but I do not see any mention of a x=blah in that list of supported additional attributes. I have tried this method of taking the "JOBNODEMATCHPOLICY EXACTNODE" out of the maui.cfg file and then doing my qsub with the following: qsub -l nodes=256:SeaMicro:ppn=1 -W x=nmatchpolicy:exactnode script_noarch.pbs but I do not get the 256 unique nodes I asked for. I am back to 64 nodes with four processes packed per node. -- Steven DuChene -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Lloyd Brown Sent: Thursday, October 27, 2011 10:26 AM To: torqueusers at supercluster.org Subject: Re: [torqueusers] torque not listening to ppn request specs Well, there are a couple of approaches, at least with Moab (again, never actually used Maui; YMMV): - Make packing the default (unset JOBNODEMATCHPOLICY), and append the "-W x=nmatchpolicy:exactnode" to the job, either as a parameter to qsub, or as a "#PBS -w x=...." line in your script - Make exactnode the default, and have people who don't care about the exact layout use the "procs=x" syntax, instead of the "nodes=x:ppn=y" syntax. Again, not sure if these work with Maui, but they're worth a try. Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu On 10/27/2011 11:18 AM, DuChene, StevenX A wrote: > Cool! Thanks Lloyd! That seems to have done the trick. I got 256 unique nodes this time instead of 64. > > However does setting this policy in my maui.cfg file mean it will never pack processes if that is actually what a user intends? > -- > Steven DuChene > > -----Original Message----- > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Lloyd Brown > Sent: Thursday, October 27, 2011 10:08 AM > To: torqueusers at supercluster.org > Subject: Re: [torqueusers] torque not listening to ppn request specs > > Steve, > > I'm not a Maui expert (we use Moab), but it sounds like this is an > optimization by the scheduler. In the end, Torque just does what the > scheduler tells it to, so if it's being told to consolidate down to 64 > nodes, then it will happily do so. > > Looking at the Maui docs, though, it does seem like the > JOBNODEMATCHPOLICY has been carried over from Moab. What happens if you > put something like the following in your Maui config: > >> JOBNODEMATCHPOLICY EXACTNODE > > > > Lloyd Brown > Systems Administrator > Fulton Supercomputing Lab > Brigham Young University > http://marylou.byu.edu > > On 10/27/2011 10:53 AM, DuChene, StevenX A wrote: >> Thanks to all how are reading and responding to my pleas for assistance or guidance. >> >> We are a benchmarking center and I have a user who wants to start up his benchmark process across all 256 nodes, one process per node. Yes, right now I am using openmpi but later today I need to try all of this with the Intel MPI implementation. >> >> I tried doing the following: >> >> $(PBS_NODEFILE) > /home/myuser/mpi_test/cruddy256 >> mpirun --machinefile $PBS_NODEFILE /home/myuser/mpi_test/mpi_hello_hostname >> >> so I could try examining the nodefile I am getting from torque but all I get is a zero length file. >> >> I looked in my torque accounting logs and I see things in the execution host list of: >> >> exec_host=eatom255/3+eatom255/2+eatom255/1+eatom255/0+eatom254/3+eatom254/2+eatom254/1+eatom254/0+eatom253/3+eatom253/2+eatom253/1+eatom253/0 >> >> I copied this exec_host= stuff to a separate file and did some text munging and I only see 64 unique hosts being allocated by torque. >> >> So does that mean torque is screwing me over or could it still be some optimization being done by maui that is running as the scheduler above the torque pbs_server process? >> -- >> Steven DuChene >> >> -----Original Message----- >> From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Coyle, James J [ITACD] >> Sent: Thursday, October 27, 2011 9:19 AM >> To: Torque Users Mailing List >> Subject: Re: [torqueusers] torque not listening to ppn request specs >> >> Steve, >> >> If this is a question just of design and not of use, >> ignore the following: >> >> >> >> Getting what you want, 1 processor on N nodes. >> >> Possibilities: >> 1) One possibility is to try: >> >> qmgr -c 'set server node_pack = False' >> >> (I think that the default setup is True, which is >> what I want and use, this keeps nodes more free.) >> I don't know if that will give you the behavior that >> you want, but it does try to launch jobs on separate >> nodes. >> >> 2) Use nodes=20:ppn=4 and use --bynode option if you are using >> OpenMPI (which is what I advise users here) or >> if you are using another implementation of MPI that does >> not support --bynode or something similar, issue >> >> uniq < ${PBS_NODEFILE} > Nodefile >> mpirun -np 20 -machinefile Nodefile ./application >> >> (I actually supply a script mpirun1, which does this along >> with mpirun2, mpirun3, that supply 2, 3, etc per node for >> two clusters that use vendor MPI's based upon MPICH.) >> >> best of Luck, >> James Coyle, PhD >> High Performance Computing Group >> Iowa State Univ. >> web: http://jjc.public.iastate.edu/ >> >>> -----Original Message----- >>> From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >>> bounces at supercluster.org] On Behalf Of DuChene, StevenX A >>> Sent: Thursday, October 27, 2011 10:48 AM >>> To: Torque Users Mailing List >>> Subject: Re: [torqueusers] torque not listening to ppn request specs >>> >>> Is it possible that there is some maui setting that could have an >>> effect on packing processes on nodes (one per processor) rather than >>> spreading them out across nodes (one per node)? Some "optimization" >>> thing I need to turn off or on? >>> -- >>> Steven DuChene >>> >>> -----Original Message----- >>> From: DuChene, StevenX A >>> Sent: Thursday, October 27, 2011 8:32 AM >>> To: Torque Users Mailing List >>> Subject: RE: [torqueusers] torque not listening to ppn request specs >>> >>> Ken: >>> I tried that and my output file still shows that there are only 64 >>> unique hosts being used four times each instead of 256 hosts used 1 >>> time each. So as I said I am not getting the results out of the >>> ppn=1 directive that I am expecting. >>> -- >>> Steven DuChene >>> >>> -----Original Message----- >>> From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >>> bounces at supercluster.org] On Behalf Of Ken Nielson >>> Sent: Wednesday, October 26, 2011 10:07 AM >>> To: Torque Users Mailing List >>> Subject: Re: [torqueusers] torque not listening to ppn request specs >>> >>> >>> >>> ----- Original Message ----- >>>> From: "StevenX A DuChene" >>>> To: torqueusers at supercluster.org >>>> Sent: Tuesday, October 25, 2011 6:10:13 PM >>>> Subject: [torqueusers] torque not listening to ppn request specs >>>> >>>> >>>> >>>> >>>> >>>> Hello all: >>>> >>>> I have torque 2.5.7 and maui 3.2.6p21 installed on a couple of >>> small >>>> clusters and I am submitting the following mpi job using: >>>> >>>> >>>> >>>> qsub -l nodes=12:mynode:ppn=1 script_noarch.pbs >>>> >>>> >>>> >>>> this script is very simple as it only has one line in it to invoke >>>> the call to mpirun >>>> >>>> >>>> >>>> mpirun --machinefile $PBS_NODEFILE >>>> /home/myuser/mpi_test/mpi_hello_hostname >>>> >>>> >>>> >>>> The actual source to this is also very simple: >>>> >>>> >>>> >>>> #include >>>> >>>> #include >>>> >>>> >>>> >>>> int main(int argc, char **argv) >>>> >>>> { >>>> >>>> int *buf, i, rank, nints, len; >>>> >>>> char hostname[256]; >>>> >>>> >>>> >>>> MPI_Init(&argc,&argv); >>>> >>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank); >>>> >>>> gethostname(hostname,255); >>>> >>>> printf("Hello world! I am process number: %d on host %s\n", rank, >>>> hostname); >>>> >>>> MPI_Finalize(); >>>> >>>> return 0; >>>> >>>> } >>>> >>>> >>>> >>>> When I run this with the ppn=1 specification I would expect one >>>> processer per node spread over twelve nodes but when I look at my >>>> output file I see it is running multiple processes per node >>> instead. >>>> So as a result I do not see the output from twelve unique nodes as >>> I >>>> would expect. >>>> >>>> >>>> >>>> My nodes file has the following sorts of entries: >>>> >>>> >>>> >>>> enode01 np=4 mynode >>>> >>>> enode02 np=4 mynode >>>> >>>> enode03 np=4 mynode >>>> >>>> enode04 np=4 mynode >>>> >>>> enode05 np=4 mynode >>>> >>>> enode06 np=4 mynode >>>> >>>> enode07 np=4 mynode >>>> >>>> enode08 np=4 mynode >>>> >>>> enode09 np=4 mynode >>>> >>>> enode10 np=4 mynode >>>> >>>> enode11 np=4 mynode >>>> >>>> enode12 np=4 mynode >>>> >>>> >>>> >>>> I know I can remove the np=4 from each node specification and get >>> the >>>> one process per node but I was under the impression that I could >>> use >>>> the ppn=1 or whatever to get the same thing. >>>> >>>> >>>> >>>> Am I misunderstanding or overlooking something? >>>> >>>> -- >>>> >>> >>> >>> Steven, >>> >>> Try qsub -l nodes=12:ppn=1:mynode script_noarch.pbs >>> >>> Ken >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From jjc at iastate.edu Thu Oct 27 13:16:36 2011 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Thu, 27 Oct 2011 14:16:36 -0500 Subject: [torqueusers] torque not listening to ppn request specs In-Reply-To: References: Message-ID: Steve, Instead of: $(PBS_NODEFILE) > /home/myuser/mpi_test/cruddy25 I think you want /bin/rm -f /home/myuser/mpi_test/cruddy25 cat ${PBS_NODEFILE} > /home/myuser/mpi_test/cruddy25 If you use the attached Torque job script, which reserves full nodes so that you can't get repeats, you should get what you want, and it should work with any mpirun which supports -machinefile (all of them I think). - Jim C. >-----Original Message----- >From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >bounces at supercluster.org] On Behalf Of DuChene, StevenX A >Sent: Thursday, October 27, 2011 11:54 AM >To: Torque Users Mailing List >Subject: Re: [torqueusers] torque not listening to ppn request specs > >Thanks to all how are reading and responding to my pleas for >assistance or guidance. > >We are a benchmarking center and I have a user who wants to start up >his benchmark process across all 256 nodes, one process per node. >Yes, right now I am using openmpi but later today I need to try all >of this with the Intel MPI implementation. > >I tried doing the following: > >$(PBS_NODEFILE) > /home/myuser/mpi_test/cruddy256 >mpirun --machinefile $PBS_NODEFILE >/home/myuser/mpi_test/mpi_hello_hostname > >so I could try examining the nodefile I am getting from torque but >all I get is a zero length file. > >I looked in my torque accounting logs and I see things in the >execution host list of: > >exec_host=eatom255/3+eatom255/2+eatom255/1+eatom255/0+eatom254/3+eat >om254/2+eatom254/1+eatom254/0+eatom253/3+eatom253/2+eatom253/1+eatom >253/0 > >I copied this exec_host= stuff to a separate file and did some text >munging and I only see 64 unique hosts being allocated by torque. > >So does that mean torque is screwing me over or could it still be >some optimization being done by maui that is running as the >scheduler above the torque pbs_server process? >-- >Steven DuChene > >-----Original Message----- >From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >bounces at supercluster.org] On Behalf Of Coyle, James J [ITACD] >Sent: Thursday, October 27, 2011 9:19 AM >To: Torque Users Mailing List >Subject: Re: [torqueusers] torque not listening to ppn request specs > >Steve, > > If this is a question just of design and not of use, >ignore the following: > > > >Getting what you want, 1 processor on N nodes. > >Possibilities: >1) One possibility is to try: > >qmgr -c 'set server node_pack = False' > >(I think that the default setup is True, which is >what I want and use, this keeps nodes more free.) >I don't know if that will give you the behavior that >you want, but it does try to launch jobs on separate >nodes. > >2) Use nodes=20:ppn=4 and use --bynode option if you are using >OpenMPI (which is what I advise users here) or >if you are using another implementation of MPI that does >not support --bynode or something similar, issue > >uniq < ${PBS_NODEFILE} > Nodefile >mpirun -np 20 -machinefile Nodefile ./application > >(I actually supply a script mpirun1, which does this along >with mpirun2, mpirun3, that supply 2, 3, etc per node for >two clusters that use vendor MPI's based upon MPICH.) > >best of Luck, >James Coyle, PhD > High Performance Computing Group > Iowa State Univ. >web: http://jjc.public.iastate.edu/ > >>-----Original Message----- >>From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >>bounces at supercluster.org] On Behalf Of DuChene, StevenX A >>Sent: Thursday, October 27, 2011 10:48 AM >>To: Torque Users Mailing List >>Subject: Re: [torqueusers] torque not listening to ppn request >specs >> >>Is it possible that there is some maui setting that could have an >>effect on packing processes on nodes (one per processor) rather >than >>spreading them out across nodes (one per node)? Some "optimization" >>thing I need to turn off or on? >>-- >>Steven DuChene >> >>-----Original Message----- >>From: DuChene, StevenX A >>Sent: Thursday, October 27, 2011 8:32 AM >>To: Torque Users Mailing List >>Subject: RE: [torqueusers] torque not listening to ppn request >specs >> >>Ken: >>I tried that and my output file still shows that there are only 64 >>unique hosts being used four times each instead of 256 hosts used 1 >>time each. So as I said I am not getting the results out of the >>ppn=1 directive that I am expecting. >>-- >>Steven DuChene >> >>-----Original Message----- >>From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >>bounces at supercluster.org] On Behalf Of Ken Nielson >>Sent: Wednesday, October 26, 2011 10:07 AM >>To: Torque Users Mailing List >>Subject: Re: [torqueusers] torque not listening to ppn request >specs >> >> >> >>----- Original Message ----- >>> From: "StevenX A DuChene" >>> To: torqueusers at supercluster.org >>> Sent: Tuesday, October 25, 2011 6:10:13 PM >>> Subject: [torqueusers] torque not listening to ppn request specs >>> >>> >>> >>> >>> >>> Hello all: >>> >>> I have torque 2.5.7 and maui 3.2.6p21 installed on a couple of >>small >>> clusters and I am submitting the following mpi job using: >>> >>> >>> >>> qsub -l nodes=12:mynode:ppn=1 script_noarch.pbs >>> >>> >>> >>> this script is very simple as it only has one line in it to >invoke >>> the call to mpirun >>> >>> >>> >>> mpirun --machinefile $PBS_NODEFILE >>> /home/myuser/mpi_test/mpi_hello_hostname >>> >>> >>> >>> The actual source to this is also very simple: >>> >>> >>> >>> #include >>> >>> #include >>> >>> >>> >>> int main(int argc, char **argv) >>> >>> { >>> >>> int *buf, i, rank, nints, len; >>> >>> char hostname[256]; >>> >>> >>> >>> MPI_Init(&argc,&argv); >>> >>> MPI_Comm_rank(MPI_COMM_WORLD, &rank); >>> >>> gethostname(hostname,255); >>> >>> printf("Hello world! I am process number: %d on host %s\n", rank, >>> hostname); >>> >>> MPI_Finalize(); >>> >>> return 0; >>> >>> } >>> >>> >>> >>> When I run this with the ppn=1 specification I would expect one >>> processer per node spread over twelve nodes but when I look at my >>> output file I see it is running multiple processes per node >>instead. >>> So as a result I do not see the output from twelve unique nodes >as >>I >>> would expect. >>> >>> >>> >>> My nodes file has the following sorts of entries: >>> >>> >>> >>> enode01 np=4 mynode >>> >>> enode02 np=4 mynode >>> >>> enode03 np=4 mynode >>> >>> enode04 np=4 mynode >>> >>> enode05 np=4 mynode >>> >>> enode06 np=4 mynode >>> >>> enode07 np=4 mynode >>> >>> enode08 np=4 mynode >>> >>> enode09 np=4 mynode >>> >>> enode10 np=4 mynode >>> >>> enode11 np=4 mynode >>> >>> enode12 np=4 mynode >>> >>> >>> >>> I know I can remove the np=4 from each node specification and get >>the >>> one process per node but I was under the impression that I could >>use >>> the ppn=1 or whatever to get the same thing. >>> >>> >>> >>> Am I misunderstanding or overlooking something? >>> >>> -- >>> >> >> >>Steven, >> >>Try qsub -l nodes=12:ppn=1:mynode script_noarch.pbs >> >>Ken >>_______________________________________________ >>torqueusers mailing list >>torqueusers at supercluster.org >>http://www.supercluster.org/mailman/listinfo/torqueusers >>_______________________________________________ >>torqueusers mailing list >>torqueusers at supercluster.org >>http://www.supercluster.org/mailman/listinfo/torqueusers >_______________________________________________ >torqueusers mailing list >torqueusers at supercluster.org >http://www.supercluster.org/mailman/listinfo/torqueusers >_______________________________________________ >torqueusers mailing list >torqueusers at supercluster.org >http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- A non-text attachment was scrubbed... Name: myjob Type: application/octet-stream Size: 826 bytes Desc: myjob Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20111027/0e59540a/attachment.obj From gus at ldeo.columbia.edu Thu Oct 27 13:30:23 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Thu, 27 Oct 2011 15:30:23 -0400 Subject: [torqueusers] torque not listening to ppn request specs In-Reply-To: References: Message-ID: <9FEE1936-8DE7-4069-BC9D-3F980009E258@ldeo.columbia.edu> Hi Steven On Oct 27, 2011, at 12:53 PM, DuChene, StevenX A wrote: > Thanks to all how are reading and responding to my pleas for assistance or guidance. > > We are a benchmarking center and I have a user who wants to start up his benchmark process across all 256 nodes, one process per node. The OpenMPI mpirun/mpiexec command has the options '-bynode/-pernode/-npernode' which should do this, as long as you request all 256 nodes from Torque [with #PBS -l nodes=256:ppn=8 assuming you have 8 cores per node] See 'man mpiexec' for more details. > Yes, right now I am using openmpi but later today I need to try all of this with the Intel MPI implementation. > > I tried doing the following: > > $(PBS_NODEFILE) > /home/myuser/mpi_test/cruddy256 > mpirun --machinefile $PBS_NODEFILE /home/myuser/mpi_test/mpi_hello_hostname > Is it a typo or did you miss the 'cat' command, as in 'cat $PBS_NODEFILE > ...'? BTW, if you build OpenMPI from source/tarball with Torque support [configure --prefix=/your/favorite/location/to/install --with-tm=/path/to/libtorque] then mpiexec will use $PBS_NODEFILE automatically as its machinefile, no need to create it by hand. I hope this helps, Gus Correa > so I could try examining the nodefile I am getting from torque but all I get is a zero length file. > > I looked in my torque accounting logs and I see things in the execution host list of: > > exec_host=eatom255/3+eatom255/2+eatom255/1+eatom255/0+eatom254/3+eatom254/2+eatom254/1+eatom254/0+eatom253/3+eatom253/2+eatom253/1+eatom253/0 > > I copied this exec_host= stuff to a separate file and did some text munging and I only see 64 unique hosts being allocated by torque. > > So does that mean torque is screwing me over or could it still be some optimization being done by maui that is running as the scheduler above the torque pbs_server process? > -- > Steven DuChene > > -----Original Message----- > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Coyle, James J [ITACD] > Sent: Thursday, October 27, 2011 9:19 AM > To: Torque Users Mailing List > Subject: Re: [torqueusers] torque not listening to ppn request specs > > Steve, > > If this is a question just of design and not of use, > ignore the following: > > > > Getting what you want, 1 processor on N nodes. > > Possibilities: > 1) One possibility is to try: > > qmgr -c 'set server node_pack = False' > > (I think that the default setup is True, which is > what I want and use, this keeps nodes more free.) > I don't know if that will give you the behavior that > you want, but it does try to launch jobs on separate > nodes. > > 2) Use nodes=20:ppn=4 and use --bynode option if you are using > OpenMPI (which is what I advise users here) or > if you are using another implementation of MPI that does > not support --bynode or something similar, issue > > uniq < ${PBS_NODEFILE} > Nodefile > mpirun -np 20 -machinefile Nodefile ./application > > (I actually supply a script mpirun1, which does this along > with mpirun2, mpirun3, that supply 2, 3, etc per node for > two clusters that use vendor MPI's based upon MPICH.) > > best of Luck, > James Coyle, PhD > High Performance Computing Group > Iowa State Univ. > web: http://jjc.public.iastate.edu/ > >> -----Original Message----- >> From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >> bounces at supercluster.org] On Behalf Of DuChene, StevenX A >> Sent: Thursday, October 27, 2011 10:48 AM >> To: Torque Users Mailing List >> Subject: Re: [torqueusers] torque not listening to ppn request specs >> >> Is it possible that there is some maui setting that could have an >> effect on packing processes on nodes (one per processor) rather than >> spreading them out across nodes (one per node)? Some "optimization" >> thing I need to turn off or on? >> -- >> Steven DuChene >> >> -----Original Message----- >> From: DuChene, StevenX A >> Sent: Thursday, October 27, 2011 8:32 AM >> To: Torque Users Mailing List >> Subject: RE: [torqueusers] torque not listening to ppn request specs >> >> Ken: >> I tried that and my output file still shows that there are only 64 >> unique hosts being used four times each instead of 256 hosts used 1 >> time each. So as I said I am not getting the results out of the >> ppn=1 directive that I am expecting. >> -- >> Steven DuChene >> >> -----Original Message----- >> From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >> bounces at supercluster.org] On Behalf Of Ken Nielson >> Sent: Wednesday, October 26, 2011 10:07 AM >> To: Torque Users Mailing List >> Subject: Re: [torqueusers] torque not listening to ppn request specs >> >> >> >> ----- Original Message ----- >>> From: "StevenX A DuChene" >>> To: torqueusers at supercluster.org >>> Sent: Tuesday, October 25, 2011 6:10:13 PM >>> Subject: [torqueusers] torque not listening to ppn request specs >>> >>> >>> >>> >>> >>> Hello all: >>> >>> I have torque 2.5.7 and maui 3.2.6p21 installed on a couple of >> small >>> clusters and I am submitting the following mpi job using: >>> >>> >>> >>> qsub -l nodes=12:mynode:ppn=1 script_noarch.pbs >>> >>> >>> >>> this script is very simple as it only has one line in it to invoke >>> the call to mpirun >>> >>> >>> >>> mpirun --machinefile $PBS_NODEFILE >>> /home/myuser/mpi_test/mpi_hello_hostname >>> >>> >>> >>> The actual source to this is also very simple: >>> >>> >>> >>> #include >>> >>> #include >>> >>> >>> >>> int main(int argc, char **argv) >>> >>> { >>> >>> int *buf, i, rank, nints, len; >>> >>> char hostname[256]; >>> >>> >>> >>> MPI_Init(&argc,&argv); >>> >>> MPI_Comm_rank(MPI_COMM_WORLD, &rank); >>> >>> gethostname(hostname,255); >>> >>> printf("Hello world! I am process number: %d on host %s\n", rank, >>> hostname); >>> >>> MPI_Finalize(); >>> >>> return 0; >>> >>> } >>> >>> >>> >>> When I run this with the ppn=1 specification I would expect one >>> processer per node spread over twelve nodes but when I look at my >>> output file I see it is running multiple processes per node >> instead. >>> So as a result I do not see the output from twelve unique nodes as >> I >>> would expect. >>> >>> >>> >>> My nodes file has the following sorts of entries: >>> >>> >>> >>> enode01 np=4 mynode >>> >>> enode02 np=4 mynode >>> >>> enode03 np=4 mynode >>> >>> enode04 np=4 mynode >>> >>> enode05 np=4 mynode >>> >>> enode06 np=4 mynode >>> >>> enode07 np=4 mynode >>> >>> enode08 np=4 mynode >>> >>> enode09 np=4 mynode >>> >>> enode10 np=4 mynode >>> >>> enode11 np=4 mynode >>> >>> enode12 np=4 mynode >>> >>> >>> >>> I know I can remove the np=4 from each node specification and get >> the >>> one process per node but I was under the impression that I could >> use >>> the ppn=1 or whatever to get the same thing. >>> >>> >>> >>> Am I misunderstanding or overlooking something? >>> >>> -- >>> >> >> >> Steven, >> >> Try qsub -l nodes=12:ppn=1:mynode script_noarch.pbs >> >> Ken >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From siegert at sfu.ca Thu Oct 27 17:48:01 2011 From: siegert at sfu.ca (Martin Siegert) Date: Thu, 27 Oct 2011 16:48:01 -0700 Subject: [torqueusers] Using procs and grouping by even numbers of processors/node In-Reply-To: <6148E9DE-B6F4-4CCA-A41B-92539888CBAA@quantumbioinc.com> References: <6148E9DE-B6F4-4CCA-A41B-92539888CBAA@quantumbioinc.com> Message-ID: <20111027234801.GG28947@stikine.sfu.ca> Hi Lance, On Thu, Oct 27, 2011 at 10:57:00AM -0400, Lance Westerhoff wrote: > > Hello All- > > Another question that has come up in the use of the procs qsub argument. > When you run GAMESS, apparently you need to group the processors evenly > among nodes. In other words, if you ask for procs=4 and 3 processors are > allocated on one node and 1 allocated on the other, the job will fail > since GAMESS requires that an even number of processes be available on > the nodes on which it runs. In this case, the job would have been > successful if 2 processes were started on one node and 2 processes > were started on the other. Is there anyway to modify or augment the > procs option in order to make sure that an even number of processes are > started on each node? > > Thanks! > > -Lance > ____________________ > Lance M. Westerhoff, Ph.D. > General Manager > QuantumBio Inc. > > WWW: http://www.quantumbioinc.com > Email: lance at quantumbioinc.com > > Phone: 814-235-6908 > Fax: 814-235-6909 Actually, GAMESS prefers for best overall performance to run two processes on the same core since the data process is mostly idle (i.e., running the data process on a core by itself is a waste of resources). Thus, the following should work: #PBS -l procs=15 rm -f myhosts.$PBS_JOBID while read line; do echo $line >> myhosts.$PBS_JOBID echo $line >> myhosts.$PBS_JOBID done < $PBS_NODEFILE NP=`expr $PBS_NP \* 2` mpiexec -n $NP -hostfile myhosts.$PBS_JOBID gamess.x You probably want to use cpusets when doing things like this. Cheers, Martin -- Martin Siegert Head, Research Computing Simon Fraser University From vanw at sabalcore.com Fri Oct 28 06:15:32 2011 From: vanw at sabalcore.com (Kevin Van Workum) Date: Fri, 28 Oct 2011 08:15:32 -0400 Subject: [torqueusers] Allocate job on node only if it requested by job node properties In-Reply-To: <4EA977C5.9030000@ssau.ru> References: <4EA977C5.9030000@ssau.ru> Message-ID: On Thu, Oct 27, 2011 at 11:24 AM, Alexandr Baskakov wrote: > Hi. > > Can anybody help me, please. > I need configure TORQUE 3.0.2/Maui 3.3 to allocate particular job on node, > only if node has been requested by job node properties. > > For example, we have 3 nodes: > n1 np=8 > n2 np=8 > n3 np=8 bigmem > > If I submit job with "-l nodes=2:ppn=8", it must run on n1 and n2. > If nodes n1,n2 already executing a some jobs, then if I submit job with "-l > nodes=2:ppn=8", it must been queued, and wait n1,n2. > If I submit job with "-l nodes=2:ppn=8:bigmem", it must be run on n3,n1 or > n3,n2. > > Other words, regular jobs, if node properties "bigmem" has not requested, > should't be run on node n3. > nodes=2:ppn=8 means "any 2 nodes with 8 cores". Since n3 fits this description, it is considered eligible for the job. What you want is to specify nodes that are not-bigmem. We don't use Maui, so I don't know if it has any set operators for node properties other than ":" (AND), but a Complement operator is what you need. An other option is to assign all other nodes a 'not-bigmem' property. Then optionally, you could set 'not-bigmem' as a default property. BTW, in your example, nodes=2:ppn=8:bigmem, will never run. You probably mean to say nodes=1:ppn=8:bigmem+1:ppn=8. -- Kevin Van Workum, PhD Sabalcore Computing Inc. Run your code on 500 processors. Sign up for a free trial account. www.sabalcore.com 877-492-8027 ext. 11 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111028/54ebdcdd/attachment.html From avb at ssau.ru Fri Oct 28 06:51:32 2011 From: avb at ssau.ru (Alexandr Baskakov) Date: Fri, 28 Oct 2011 16:51:32 +0400 Subject: [torqueusers] Allocate job on node only if it requested by job node properties In-Reply-To: References: <4EA977C5.9030000@ssau.ru> Message-ID: <4EAAA554.9060901@ssau.ru> Thanx, for answer. >BTW, in your example, nodes=2:ppn=8:bigmem, will never run. You probably mean to say nodes=1:ppn=8:bigmem+1:ppn=8. Yes, I mean "nodes=1:ppn=8:bigmem+1:ppn=8". In Maui, there is a standing reservation parameters, that can be used SRCFG. SRCFG[bigmem] PERIOD=INFINITY SRCFG[bigmem] CLASSLIST=batch SRCFG[bigmem] HOSTLIST=n3 But there is no ACL properties in SRCFG to allow use node n3 only if "bigmem" node's properties requested by qsub. 28.10.2011 16:15, Kevin Van Workum ?????: > On Thu, Oct 27, 2011 at 11:24 AM, Alexandr Baskakov > wrote: > > Hi. > > Can anybody help me, please. > I need configure TORQUE 3.0.2/Maui 3.3 to allocate particular job on node, only if node has been requested by job node properties. > > For example, we have 3 nodes: > n1 np=8 > n2 np=8 > n3 np=8 bigmem > > If I submit job with "-l nodes=2:ppn=8", it must run on n1 and n2. > If nodes n1,n2 already executing a some jobs, then if I submit job with "-l nodes=2:ppn=8", it must been queued, and wait n1,n2. > If I submit job with "-l nodes=2:ppn=8:bigmem", it must be run on n3,n1 or n3,n2. > > Other words, regular jobs, if node properties "bigmem" has not requested, should't be run on node n3. > > >> nodes=2:ppn=8 means "any 2 nodes with 8 cores". Since n3 fits this description, it is considered eligible for the job. What you want is to specify nodes that are not-bigmem. We don't use Maui, so I >> don't know if it has any set operators for node properties other than ":" (AND), but a Complement operator is what you need. >> >> An other option is to assign all other nodes a 'not-bigmem' property. Then optionally, you could set 'not-bigmem' as a default property. >> >> BTW, in your example, nodes=2:ppn=8:bigmem, will never run. You probably mean to say nodes=1:ppn=8:bigmem+1:ppn=8. >> -- Alexandr Baskakov, Samara State Aerospace University e-mail: avb at ssau.ru From knielson at adaptivecomputing.com Fri Oct 28 11:23:16 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Fri, 28 Oct 2011 11:23:16 -0600 (MDT) Subject: [torqueusers] PAM documentation In-Reply-To: Message-ID: <72acdfbd-d49e-44e5-b5f4-160c44e863a3@mail> Hi all, Does anyone use PAM with torque as described in the current documentation? http://www.adaptivecomputing.com/resources/docs/torque/3.4hostsecurity.php >From what I have gathered from a few users and what the code currently does this description is outdated. To build with pam enabled you only need to use --with-pam on the configure line. There is no need to pull in the code from the contrib directory. Regards Ken From stevenx.a.duchene at intel.com Fri Oct 28 11:26:40 2011 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Fri, 28 Oct 2011 10:26:40 -0700 Subject: [torqueusers] PAM documentation In-Reply-To: <72acdfbd-d49e-44e5-b5f4-160c44e863a3@mail> References: <72acdfbd-d49e-44e5-b5f4-160c44e863a3@mail> Message-ID: Is this the code that allows the prologue and epilogue to lock users out of logging into compute nodes until they actually have a job active on a particular node? If so then yes, I have used this feature at almost every location I have setup torque. -- Steven DuChene -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Ken Nielson Sent: Friday, October 28, 2011 10:23 AM To: Torque Users Mailing List; Torque Developers mailing list Subject: [torqueusers] PAM documentation Hi all, Does anyone use PAM with torque as described in the current documentation? http://www.adaptivecomputing.com/resources/docs/torque/3.4hostsecurity.php >From what I have gathered from a few users and what the code currently does this description is outdated. To build with pam enabled you only need to use --with-pam on the configure line. There is no need to pull in the code from the contrib directory. Regards Ken _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From siegert at sfu.ca Fri Oct 28 11:39:56 2011 From: siegert at sfu.ca (Martin Siegert) Date: Fri, 28 Oct 2011 10:39:56 -0700 Subject: [torqueusers] using non-privileged ports Message-ID: <20111028173956.GB5953@stikine.sfu.ca> Hi, we just recompiled torque with --disable-privports (since we constantly ran out of ports). Now we have a different problem which is just as bad: # qstat -an1 Connection timed out qstat: cannot connect to server b0 (errno=110) Connection timed out This does not appear right away after starting the server, but after a few hours of running. As far as I can tell the only way to get the server out of this state is to restart it. But there must be many sites that run torque with --disable-privports. Thus: what am I missing? Cheers, Martin -- Martin Siegert Head, Research Computing Simon Fraser University From siegert at sfu.ca Fri Oct 28 14:57:15 2011 From: siegert at sfu.ca (Martin Siegert) Date: Fri, 28 Oct 2011 13:57:15 -0700 Subject: [torqueusers] using non-privileged ports In-Reply-To: <20111028173956.GB5953@stikine.sfu.ca> References: <20111028173956.GB5953@stikine.sfu.ca> Message-ID: <20111028205715.GD6732@stikine.sfu.ca> Hi, On Fri, Oct 28, 2011 at 10:39:56AM -0700, Martin Siegert wrote: > Hi, > > we just recompiled torque with > > --disable-privports > > (since we constantly ran out of ports). Now we have a different > problem which is just as bad: > > # qstat -an1 > Connection timed out > qstat: cannot connect to server b0 (errno=110) Connection timed out > > This does not appear right away after starting the server, but after > a few hours of running. As far as I can tell the only way to get the > server out of this state is to restart it. > > But there must be many sites that run torque with --disable-privports. > Thus: what am I missing? We gave up: --disable-privports does not appear to be working. Now we are back to our previous problem (this is on the server - there are no connections in the TIME_WAIT on the nodes): # netstat -na | grep 15002 tcp 0 0 172.18.1.0:629 172.18.1.152:15002 TIME_WAIT tcp 0 0 172.18.1.0:701 172.18.1.152:15002 TIME_WAIT tcp 0 0 172.18.1.0:689 172.18.1.152:15002 TIME_WAIT tcp 0 0 172.18.1.0:685 172.18.1.152:15002 TIME_WAIT tcp 0 0 172.18.1.0:951 172.18.1.152:15002 TIME_WAIT tcp 0 0 172.18.1.0:979 172.18.1.152:15002 TIME_WAIT tcp 0 0 172.18.1.0:962 172.18.1.152:15002 TIME_WAIT tcp 0 0 172.18.1.0:669 172.18.1.154:15002 TIME_WAIT tcp 0 0 172.18.1.0:662 172.18.1.154:15002 TIME_WAIT tcp 0 0 172.18.1.0:804 172.18.1.154:15002 TIME_WAIT ... # netstat -na | grep 15002 | wc -l 974 For some reason the mom-server connections are not closed correctly and we end up with all these sockets in the TIME_WAIT state. Note that there are even several ones for the same node. Consequently we run out of ports. Is this a torque problem? w we work around the problem by setting net.ipv4.tcp_tw_recycle = 1 net.ipv4.tcp_tw_reuse = 1 Cheers, Martin From knielson at adaptivecomputing.com Fri Oct 28 15:13:26 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Fri, 28 Oct 2011 15:13:26 -0600 (MDT) Subject: [torqueusers] using non-privileged ports In-Reply-To: <20111028205715.GD6732@stikine.sfu.ca> Message-ID: ----- Original Message ----- > From: "Martin Siegert" > To: torqueusers at supercluster.org > Sent: Friday, October 28, 2011 2:57:15 PM > Subject: Re: [torqueusers] using non-privileged ports > > Hi, > > On Fri, Oct 28, 2011 at 10:39:56AM -0700, Martin Siegert wrote: > > Hi, > > > > we just recompiled torque with > > > > --disable-privports > > > > (since we constantly ran out of ports). Now we have a different > > problem which is just as bad: > > > > # qstat -an1 > > Connection timed out > > qstat: cannot connect to server b0 (errno=110) Connection timed out > > > > This does not appear right away after starting the server, but > > after > > a few hours of running. As far as I can tell the only way to get > > the > > server out of this state is to restart it. > > > > But there must be many sites that run torque with > > --disable-privports. > > Thus: what am I missing? > > We gave up: --disable-privports does not appear to be working. Now we > are back to our previous problem (this is on the server - there are > no connections in the TIME_WAIT on the nodes): > > # netstat -na | grep 15002 > tcp 0 0 172.18.1.0:629 172.18.1.152:15002 > TIME_WAIT > tcp 0 0 172.18.1.0:701 172.18.1.152:15002 > TIME_WAIT > tcp 0 0 172.18.1.0:689 172.18.1.152:15002 > TIME_WAIT > tcp 0 0 172.18.1.0:685 172.18.1.152:15002 > TIME_WAIT > tcp 0 0 172.18.1.0:951 172.18.1.152:15002 > TIME_WAIT > tcp 0 0 172.18.1.0:979 172.18.1.152:15002 > TIME_WAIT > tcp 0 0 172.18.1.0:962 172.18.1.152:15002 > TIME_WAIT > tcp 0 0 172.18.1.0:669 172.18.1.154:15002 > TIME_WAIT > tcp 0 0 172.18.1.0:662 172.18.1.154:15002 > TIME_WAIT > tcp 0 0 172.18.1.0:804 172.18.1.154:15002 > TIME_WAIT > ... > # netstat -na | grep 15002 | wc -l > 974 > > For some reason the mom-server connections are not closed correctly > and we > end up with all these sockets in the TIME_WAIT state. Note that there > are > even several ones for the same node. Consequently we run out of > ports. > > Is this a torque problem? > > > w we work around the problem by setting > > net.ipv4.tcp_tw_recycle = 1 > net.ipv4.tcp_tw_reuse = 1 > > Cheers, > Martin Martin, Thanks for the information. I will see what is happening with this. Ken From knielson at adaptivecomputing.com Fri Oct 28 17:01:51 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Fri, 28 Oct 2011 17:01:51 -0600 (MDT) Subject: [torqueusers] 2.5.9 release candidate In-Reply-To: <2ea70129-1591-47a8-ad71-5f6265425272@mail> Message-ID: <9f1ec59d-04b5-4ee0-9f6b-4118f3148f82@mail> There is a 2.5.9 release candidate available at http://www.clusterresources.com/downloads/torque/snapshots/torque-2.5.9-snap.201110281656.tar.gz The following is from the Release_Notes file from the source: There were several bug fixes to TORQUE 2.5.9. Following are a list of noteable bug fixes. Added function DIS_tcp_close which frees buffer memory used for sending and receiving tcp data. This reduces the running memory size of TORQUE. Fix for a server seg-fault when using the record_job_info. Fix for afteranyarray and afterokarry where dependent jobs would not run after the dependent array requirements were satisfied. Fix to delete .AR array files from the $TORQUE_HOME/server_priv/arrays directory. Fix to recover previous state of job arrays between restarts of pbs_server Fix to prevent the server from hanging when moving jobs from one server to another server Fix to stop a segfault if using munge and the munge daemon was not running Security fix to munge authorization to prevent users from gaining access to TORQUE when munge was not running. Fix to allow pam_pbssimpleauth to work properly. A new torque.cfg option as added named TRQ_IFNAME. This option allows the administrator to select the outbound tcp interface by interface name for qsub commands. To see a compelete list of changes please see the CHANGELOG. Please download and try this snapshot and let us know if there are additional issues that we need to fix before we release 2.5.9. Regards Ken Nielson Adaptive Computing From mej at lbl.gov Fri Oct 28 18:15:56 2011 From: mej at lbl.gov (Michael Jennings) Date: Fri, 28 Oct 2011 17:15:56 -0700 Subject: [torqueusers] using non-privileged ports In-Reply-To: <20111028205715.GD6732@stikine.sfu.ca> References: <20111028173956.GB5953@stikine.sfu.ca> <20111028205715.GD6732@stikine.sfu.ca> Message-ID: <20111029001555.GL12776@lbl.gov> On Friday, 28 October 2011, at 13:57:15 (-0700), Martin Siegert wrote: > # netstat -na | grep 15002 > tcp 0 0 172.18.1.0:629 172.18.1.152:15002 TIME_WAIT > tcp 0 0 172.18.1.0:701 172.18.1.152:15002 TIME_WAIT > tcp 0 0 172.18.1.0:689 172.18.1.152:15002 TIME_WAIT > tcp 0 0 172.18.1.0:685 172.18.1.152:15002 TIME_WAIT > tcp 0 0 172.18.1.0:951 172.18.1.152:15002 TIME_WAIT > tcp 0 0 172.18.1.0:979 172.18.1.152:15002 TIME_WAIT > tcp 0 0 172.18.1.0:962 172.18.1.152:15002 TIME_WAIT > tcp 0 0 172.18.1.0:669 172.18.1.154:15002 TIME_WAIT > tcp 0 0 172.18.1.0:662 172.18.1.154:15002 TIME_WAIT > tcp 0 0 172.18.1.0:804 172.18.1.154:15002 TIME_WAIT > ... > # netstat -na | grep 15002 | wc -l > 974 > > For some reason the mom-server connections are not closed correctly and we > end up with all these sockets in the TIME_WAIT state. Note that there are > even several ones for the same node. Consequently we run out of ports. > > Is this a torque problem? > > > w we work around the problem by setting > > net.ipv4.tcp_tw_recycle = 1 > net.ipv4.tcp_tw_reuse = 1 This points strongly to the following being missing somewhere: int i = 1; setsockopt(sock, SOL_SOCKET, SO_REUSEADDR, (char *)&i, sizeof(i)); Possibly in the code that opens the sockets to connect to the moms? Michael -- Michael Jennings Linux Systems and Cluster Engineer High-Performance Computing Services Bldg 50B-3209E W: 510-495-2687 MS 050C-3396 F: 510-486-8615 From blake.wickliffe at aramco.com Fri Oct 28 22:44:20 2011 From: blake.wickliffe at aramco.com (Wickliffe, Blake W) Date: Sat, 29 Oct 2011 04:44:20 +0000 Subject: [torqueusers] 2.5.9 release candidate In-Reply-To: <9f1ec59d-04b5-4ee0-9f6b-4118f3148f82@mail> References: <2ea70129-1591-47a8-ad71-5f6265425272@mail> <9f1ec59d-04b5-4ee0-9f6b-4118f3148f82@mail> Message-ID: <09D3C16B37878C44837F749DB16ACF190BA89D@DHA00730-MSXP03.aramco.com> No fix for pbs_server hang when a node goes down? :( Blake Wickliffe Saudi Aramco ENOD/CSYS/USG HPC Team (873-4417) -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Ken Nielson Sent: Saturday, October 29, 2011 2:02 AM To: Torque Users Mailing List; Torque Developers mailing list Subject: [torqueusers] 2.5.9 release candidate There is a 2.5.9 release candidate available at http://www.clusterresources.com/downloads/torque/snapshots/torque-2.5.9-snap.201110281656.tar.gz The following is from the Release_Notes file from the source: There were several bug fixes to TORQUE 2.5.9. Following are a list of noteable bug fixes. Added function DIS_tcp_close which frees buffer memory used for sending and receiving tcp data. This reduces the running memory size of TORQUE. Fix for a server seg-fault when using the record_job_info. Fix for afteranyarray and afterokarry where dependent jobs would not run after the dependent array requirements were satisfied. Fix to delete .AR array files from the $TORQUE_HOME/server_priv/arrays directory. Fix to recover previous state of job arrays between restarts of pbs_server Fix to prevent the server from hanging when moving jobs from one server to another server Fix to stop a segfault if using munge and the munge daemon was not running Security fix to munge authorization to prevent users from gaining access to TORQUE when munge was not running. Fix to allow pam_pbssimpleauth to work properly. A new torque.cfg option as added named TRQ_IFNAME. This option allows the administrator to select the outbound tcp interface by interface name for qsub commands. To see a compelete list of changes please see the CHANGELOG. Please download and try this snapshot and let us know if there are additional issues that we need to fix before we release 2.5.9. Regards Ken Nielson Adaptive Computing _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers ________________________________ The contents of this email, including all related responses, files and attachments transmitted with it (collectively referred to as ?this Email?), are intended solely for the use of the individual/entity to whom/which they are addressed, and may contain confidential and/or legally privileged information. This Email may not be disclosed or forwarded to anyone else without authorization from the originator of this Email. If you have received this Email in error, please notify the sender immediately and delete all copies from your system. Please note that the views or opinions presented in this Email are those of the author and may not necessarily represent those of Saudi Aramco. The recipient should check this Email and any attachments for the presence of any viruses. Saudi Aramco accepts no liability for any damage caused by any virus/error transmitted by this Email. From samuel at unimelb.edu.au Sun Oct 30 22:20:27 2011 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Mon, 31 Oct 2011 15:20:27 +1100 Subject: [torqueusers] Stale connection error in log file of torque server In-Reply-To: <88B17E26E0A9F94381C67535AEF2BB5E2A685B28@EXCHNG13.physics.ox.ac.uk> References: <88B17E26E0A9F94381C67535AEF2BB5E2A685B28@EXCHNG13.physics.ox.ac.uk> Message-ID: <4EAE220B.4030009@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 27/10/11 22:20, Kashif Mohammad wrote: > Any suggestion please ? Upgrade to a newer version of Torque, 2.3 is ancient. Personally I'd suggest just going to 2.4.16 as it's working here quite happily (though not with glite). I suspect the Maui version is also pretty old, I think they're up to 3.3.x now. That I can't comment on as we use Moab here with Torque. cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk6uIgsACgkQO2KABBYQAh+NFACbBie55l3CVsctQismJxtbjTs/ 358An1EKAAhThnbdW//5JgLlomYYlTbC =aT3O -----END PGP SIGNATURE-----