[torqueusers] procs= not working as documented (or understood?)

Gareth.Williams at csiro.au Gareth.Williams at csiro.au
Tue Feb 14 16:27:25 MST 2012


> -----Original Message-----
> From: Lance Westerhoff [mailto:lance at quantumbioinc.com]
> Sent: Wednesday, 15 February 2012 5:12 AM
> To: Torque Users Mailing List
> Subject: Re: [torqueusers] procs= not working as documented (or
> understood?)
> 
> 
> Hello All-
> 
> We're still having trouble with this feature, and we are starting to
> shop around for a torque/maui replacement in order to be able to use
> it. Before we do that however, I wanted to see if anyone has any
> thoughts on how to address the problem within torque/maui. Perhaps I
> simply don't understand the feature. The versions of torque and maui we
> are using are.
> 
> 	torque-3.0.2
> 	maui-3.2.6p21
> 
> Yes, we have tried newer versions of maui, but then the option doesn't
> work at all.
> 
> Here is the scenario (I also included the conversation from November
> below for more information).
> 
> Conceptually, our software is almost infinitely scalable in the sense
> that there is very little overhead associated with interprocess
> communication. Therefore, we do not require that all of the processes
> reside on a small number of nodes. In fact, we can stretch the
> processors to any and all nodes in the cluster with ~zero loss in
> performance. So we can literally have one node that has a single
> process running and another node that has 8 processes running. Since we
> have that level of scalability, we don't want to have to lock ourselves
> into having to request resources using the "nodes=X:ppn=Y" style since
> this style requires that nodes open up or drain in order to use them.
> Since our users have a big mixture of single and multi-processor jobs,
> waiting for node drain can really waste a lot of resources.
> 
> I saw the "procs=#" the Requesting Resources table (see
> http://www.clusterresources.com/torquedocs/2.1jobsubmission.shtml#resou
> rces for more). It *appears* that this option should be able to allow
> the user to request simply X*Y processors and the scheduler should be
> able to schedule them any way it can fit. So using the following #PBS
> note, we should be able to request 40 processors:
> 
> #PBS -l procs=40
> 
> Instead, we see that the scheduler seems to take this information, read
> it, and basically disregard it. The reason I know it reads it is
> because if I ask for say 40 processors and 40 processors are available
> in the cluster, it works as expected and all is right with the world.
> Where it gets a bit more choppy is when I ask for 40 processors and
> only 1 processor is available. The job doesn't wait in the queue for
> the remaining 39 processors to open up, and instead PBS simply just
> starts the job on that processor. I can't see how that is anything but
> a bug. If the user is asking for 40 processors, why isn't the scheduler
> waiting for all 40 processors to open up?
> 
> I'll also post this to the maui list so I apologize if you receive it
> twice. I'm just not sure if this is a problem with torque, maui, or
> both. If answering this question will require additional information,
> please ask. We are at our wits end here.
> 
> Thanks!
> 
> -Lance

Hi Lance,

It is more-or-less equivalent to request -l nodes=40 and -l procs=40 _if_ 
you are using maui _and_ you don't set JOBNODEMATCHPOLICY to EXACTNODE

You may need to 'fake' having a large number of nodes to make this work.
  
There are old mailing list items describing such a setup and how to 'fake' 
the number of nodes.  We've never actually done this at our site so I'm 
unsure on details. I don't like it :-) but it may suit/help you.

Gareth


> 
> 
> 
> 
> On Nov 18, 2011, at 11:12 AM, Lance Westerhoff wrote:
> 
> >
> > Hi Steve-
> >
> > Here you go. Here is the top few lines of the job script. I have then
> provided the output you requested long with the maui.cfg. If you need
> anything further, certainly please let me know.
> >
> > Thanks for your help!
> >
> > ===============
> >
> > + head job.pbs
> >
> > #!/bin/bash
> > #PBS -S /bin/bash
> > #PBS -l procs=100
> > #PBS -l pmem=700mb
> > #PBS -l walltime=744:00:00
> > #PBS -j oe
> > #PBS -q batch
> >
> > Report run on Fri Nov 18 10:49:38 EST 2011
> > + pbsnodes --version
> > version: 3.0.2
> > + diagnose --version
> > maui client version 3.2.6p21
> > + checkjob 371010
> >
> >
> > checking job 371010
> >
> > State: Running
> > Creds:  user:josh  group:games  class:batch  qos:DEFAULT
> > WallTime: 00:02:35 of 31:00:00:00
> > SubmitTime: Fri Nov 18 10:46:33
> >  (Time Queued  Total: 00:00:01  Eligible: 00:00:01)
> >
> > StartTime: Fri Nov 18 10:46:34
> > Total Tasks: 1
> >
> > Req[0]  TaskCount: 26  Partition: DEFAULT
> > Network: [NONE]  Memory >= 700M  Disk >= 0  Swap >= 0
> > Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
> > Dedicated Resources Per Task: PROCS: 1  MEM: 700M
> > NodeCount: 10
> > Allocated Nodes:
> > [compute-0-17:7][compute-0-10:4][compute-0-3:2][compute-0-5:3]
> > [compute-0-6:1][compute-0-7:2][compute-0-9:1][compute-0-12:2]
> > [compute-0-13:2][compute-0-14:2]
> >
> >
> > IWD: [NONE]  Executable:  [NONE]
> > Bypass: 0  StartCount: 1
> > PartitionMask: [ALL]
> > Flags:       RESTARTABLE
> >
> > Reservation '371010' (-00:02:09 -> 30:23:57:51  Duration:
> 31:00:00:00)
> > PE:  26.00  StartPriority:  4716
> >
> > + cat /opt/maui/maui.cfg | grep -v "#" | grep "^[A-Z]"
> > SERVERHOST            gondor
> > ADMIN1                maui root
> > ADMIN3                ALL
> > RMCFG[base]  TYPE=PBS
> > AMCFG[bank]  TYPE=NONE
> > RMPOLLINTERVAL        00:01:00
> > SERVERPORT            42559
> > SERVERMODE            NORMAL
> > LOGFILE               maui.log
> > LOGFILEMAXSIZE        10000000
> > LOGLEVEL              3
> > QUEUETIMEWEIGHT       1
> > FSPOLICY              DEDICATEDPS
> > FSDEPTH               7
> > FSINTERVAL            86400
> > FSDECAY               0.50
> > FSWEIGHT              200
> > FSUSERWEIGHT          1
> > FSGROUPWEIGHT         1000
> > FSQOSWEIGHT           1000
> > FSACCOUNTWEIGHT       1
> > FSCLASSWEIGHT         1000
> > USERWEIGHT            4
> > BACKFILLPOLICY        FIRSTFIT
> > RESERVATIONPOLICY     CURRENTHIGHEST
> > NODEALLOCATIONPOLICY  MINRESOURCE
> > RESERVATIONDEPTH            8
> > MAXJOBPERUSERPOLICY         OFF
> > MAXJOBPERUSERCOUNT          8
> > MAXPROCPERUSERPOLICY        OFF
> > MAXPROCPERUSERCOUNT         256
> > MAXPROCSECONDPERUSERPOLICY  OFF
> > MAXPROCSECONDPERUSERCOUNT   36864000
> > MAXJOBQUEUEDPERUSERPOLICY   OFF
> > MAXJOBQUEUEDPERUSERCOUNT    2
> > JOBNODEMATCHPOLICY          EXACTNODE
> > NODEACCESSPOLICY            SHARED
> > JOBMAXOVERRUN 99:00:00:00
> > DEFERCOUNT 8192
> > DEFERTIME  0
> > CLASSCFG[developer] FSTARGET=40.00+
> > CLASSCFG[lowprio] PRIORITY=-1000
> > SRCFG[developer] CLASSLIST=developer
> > SRCFG[developer] ACCESS=dedicated
> > SRCFG[developer] DAYS=Mon,Tue,Wed,Thu,Fri
> > SRCFG[developer] STARTTIME=08:00:00
> > SRCFG[developer] ENDTIME=18:00:00
> > SRCFG[developer] TIMELIMIT=2:00:00
> > SRCFG[developer] RESOURCES=PROCS(8)
> > USERCFG[DEFAULT]      FSTARGET=100.0
> >
> > ===============
> >
> > -Lance
> >
> >
> > On Nov 18, 2011, at 9:47 AM, Steve Crusan wrote:
> >
> >> -----BEGIN PGP SIGNED MESSAGE-----
> >> Hash: SHA1
> >>
> >>
> >> On Nov 18, 2011, at 9:33 AM, Lance Westerhoff wrote:
> >>
> >>> The request that is placed is for procs=60. Both torque and maui
> see that there are only 53 processors available and instead of letting
> the job sit in the queue and wait for all 60 processors to become
> available, it goes ahead and runs the job with what's available. Now if
> the user could ask for procs=[50-60] where 50 is the minimum number of
> processors to provide and 60 is the maximum, this would be a feature.
> But as it stands, if the user asks for 60 processors and ends up with 2
> processors, the job just won't scale properly and he may as well kill
> it (when it shouldn't have run anyway).
> >>
> >> Hi Lance,
> >>
> >> 	Can you post the output of checkjob <jobid> of an incorrectly
> running job. Let's take a look at what Maui thinks the job is asking
> for.
> >>
> >> 	Might as well add your maui.cfg file also.
> >>
> >> 	I've found in the past that procs= is troublesome...
> >>
> >>>
> >>> I'm actually beginning to think the problem may be related to maui.
> Perhaps I'll post this same question to the maui list and see what
> comes back.
> >>>
> >>> This problem is infuriating though since without the functionality
> working as it should, using procs=X in torque/maui makes torque/maui
> work more like a submission and run system (not a queuing system).
> >>
> >> Agreed. HPC cluster job management is normally be set it and forget
> it. Anything else other than maintenance/break fixes/new features would
> be ridiculously time consuming.
> >>
> >>>
> >>> -Lance
> >>>
> >>>
> >>>>
> >>>> Message: 3
> >>>> Date: Thu, 17 Nov 2011 17:29:17 -0800
> >>>> From: "Brock Palen" <brockp at umich.edu>
> >>>> Subject: Re: [torqueusers] procs= not working as documented
> >>>> To: "Torque Users Mailing List" <torqueusers at supercluster.org>
> >>>> Message-ID:
> <20111118012930.C635E83A8026 at mail.adaptivecomputing.com>
> >>>> Content-Type: text/plain; charset="utf-8"
> >>>>
> >>>> Does maui only see one cpu or does mpiexec only see one cpu?
> >>>>
> >>>>
> >>>>
> >>>> Brock Palen
> >>>> (734)936-1985
> >>>> brockp at umich.edu
> >>>> - Sent from my Palm Pre, please excuse typos
> >>>> On Nov 17, 2011 3:19 PM, Lance Westerhoff
> &lt;lance at quantumbioinc.com&gt; wrote:
> >>>>
> >>>>
> >>>>
> >>>> Hello All-
> >>>>
> >>>>
> >>>>
> >>>> It appears that when running with the following specs, the procs=
> option does not actually work as expected.
> >>>>
> >>>>
> >>>>
> >>>> ==========================================
> >>>>
> >>>>
> >>>>
> >>>> #PBS -S /bin/bash
> >>>>
> >>>> #PBS -l procs=60
> >>>>
> >>>> #PBS -l pmem=700mb
> >>>>
> >>>> #PBS -l walltime=744:00:00
> >>>>
> >>>> #PBS -j oe
> >>>>
> >>>> #PBS -q batch
> >>>>
> >>>>
> >>>>
> >>>> torque version: tried 3.0.2. in v2.5.4, I think the procs option
> worked as documented
> >>>>
> >>>> maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete
> fail in terms of the procs option and it only asks for a single CPU)
> >>>>
> >>>>
> >>>>
> >>>> ==========================================
> >>>>
> >>>>
> >>>>
> >>>> If there are fewer then 60 processors available in the cluster (in
> this case there were 53 available) the job will go in an take whatever
> is left instead of waiting for all 60 processors to free up. Any
> thoughts as to why this might be happening? Sometimes it doesn't really
> matter and 53 would be almost as good as 60, however if only 2
> processors are available and the user asks for 60, I would hate for him
> to go in.
> >>>>
> >>>>
> >>>>
> >>>> Thank you for your time!
> >>>>
> >>>>
> >>>>
> >>>> -Lance
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>> _______________________________________________
> >>> torqueusers mailing list
> >>> torqueusers at supercluster.org
> >>> http://www.supercluster.org/mailman/listinfo/torqueusers
> >>
> >> ----------------------
> >> Steve Crusan
> >> System Administrator
> >> Center for Research Computing
> >> University of Rochester
> >> https://www.crc.rochester.edu/
> >>
> >>
> >> -----BEGIN PGP SIGNATURE-----
> >> Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
> >> Comment: GPGTools - http://gpgtools.org
> >>
> >> iQEcBAEBAgAGBQJOxnAEAAoJENS19LGOpgqK2CEH/Ry+THjmhxdTzcIZ5d5YYCP/
> >> bYQY2QthvbaEkUhh+q26m2EWrmPGHRgW9zXOx/fRBE2ejZE+EycpRLMdWDTOxn28
> >> cK1qs+ITaiOevNbxufd7pt/P5hhvafQgsDtuy8RPGokgqSuRBEH9i8DZAFfIASQZ
> >> tQ9YE5MSqEfaoTSwOVP2PXJCgEJh2ZU5GHO2UvmxF4SX4+7HePUgQYzmzIBu2cW8
> >> JeeIpaf2AuNIvXjG3ZNA3FjHWQEZefiZhRTQxeE1PHuQCLWPnfTwz0nzquCHZBJv
> >> Ufc1wOGanDi+LosRldVIUgAyHGcAcOvZzFnxlfNrYa2xfJSCyuC86YB4XNfpO1c=
> >> =AGW7
> >> -----END PGP SIGNATURE-----
> >> _______________________________________________
> >> torqueusers mailing list
> >> torqueusers at supercluster.org
> >> http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> 



More information about the torqueusers mailing list