[torqueusers] procs= not working as documented
Steve Crusan
scrusan at ur.rochester.edu
Fri Nov 18 12:52:04 MST 2011
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Lance,
I am not sure about this one, and I think this might be a bug in Maui. Now, just to be clear, this only happens if there are not enough resources available?
Regardless of your fairshare weights, or node access policies (which look fine, btw), this job should NOT start unless the resources are available.
What happens if you pause (stop) the scheduler, and just run the job manually with qrun? It would be interesting to see how TORQUE is interpreting the job versus Maui. In the example below, Maui seemed to think the job only required 26 tasks, and scheduled them appropriately ( if 26 procs were the true request). Also check your $PBS_NODES file to see what TORQUE is writing to your environment.
I know you can dive deep into the Maui logs by turning up the debug level, but let's start somewhere closer...
PS: I cannot reproduce this, but I'm running Moab 6.0.4 + TORQUE 2.5.6.
~Steve
On Nov 18, 2011, at 11:12 AM, Lance Westerhoff wrote:
>
> Hi Steve-
>
> Here you go. Here is the top few lines of the job script. I have then provided the output you requested long with the maui.cfg. If you need anything further, certainly please let me know.
>
> Thanks for your help!
>
> ===============
>
> + head job.pbs
>
> #!/bin/bash
> #PBS -S /bin/bash
> #PBS -l procs=100
> #PBS -l pmem=700mb
> #PBS -l walltime=744:00:00
> #PBS -j oe
> #PBS -q batch
>
> Report run on Fri Nov 18 10:49:38 EST 2011
> + pbsnodes --version
> version: 3.0.2
> + diagnose --version
> maui client version 3.2.6p21
> + checkjob 371010
>
>
> checking job 371010
>
> State: Running
> Creds: user:josh group:games class:batch qos:DEFAULT
> WallTime: 00:02:35 of 31:00:00:00
> SubmitTime: Fri Nov 18 10:46:33
> (Time Queued Total: 00:00:01 Eligible: 00:00:01)
>
> StartTime: Fri Nov 18 10:46:34
> Total Tasks: 1
>
> Req[0] TaskCount: 26 Partition: DEFAULT
> Network: [NONE] Memory >= 700M Disk >= 0 Swap >= 0
> Opsys: [NONE] Arch: [NONE] Features: [NONE]
> Dedicated Resources Per Task: PROCS: 1 MEM: 700M
> NodeCount: 10
> Allocated Nodes:
> [compute-0-17:7][compute-0-10:4][compute-0-3:2][compute-0-5:3]
> [compute-0-6:1][compute-0-7:2][compute-0-9:1][compute-0-12:2]
> [compute-0-13:2][compute-0-14:2]
>
>
> IWD: [NONE] Executable: [NONE]
> Bypass: 0 StartCount: 1
> PartitionMask: [ALL]
> Flags: RESTARTABLE
>
> Reservation '371010' (-00:02:09 -> 30:23:57:51 Duration: 31:00:00:00)
> PE: 26.00 StartPriority: 4716
>
> + cat /opt/maui/maui.cfg | grep -v "#" | grep "^[A-Z]"
> SERVERHOST gondor
> ADMIN1 maui root
> ADMIN3 ALL
> RMCFG[base] TYPE=PBS
> AMCFG[bank] TYPE=NONE
> RMPOLLINTERVAL 00:01:00
> SERVERPORT 42559
> SERVERMODE NORMAL
> LOGFILE maui.log
> LOGFILEMAXSIZE 10000000
> LOGLEVEL 3
> QUEUETIMEWEIGHT 1
> FSPOLICY DEDICATEDPS
> FSDEPTH 7
> FSINTERVAL 86400
> FSDECAY 0.50
> FSWEIGHT 200
> FSUSERWEIGHT 1
> FSGROUPWEIGHT 1000
> FSQOSWEIGHT 1000
> FSACCOUNTWEIGHT 1
> FSCLASSWEIGHT 1000
> USERWEIGHT 4
> BACKFILLPOLICY FIRSTFIT
> RESERVATIONPOLICY CURRENTHIGHEST
> NODEALLOCATIONPOLICY MINRESOURCE
> RESERVATIONDEPTH 8
> MAXJOBPERUSERPOLICY OFF
> MAXJOBPERUSERCOUNT 8
> MAXPROCPERUSERPOLICY OFF
> MAXPROCPERUSERCOUNT 256
> MAXPROCSECONDPERUSERPOLICY OFF
> MAXPROCSECONDPERUSERCOUNT 36864000
> MAXJOBQUEUEDPERUSERPOLICY OFF
> MAXJOBQUEUEDPERUSERCOUNT 2
> JOBNODEMATCHPOLICY EXACTNODE
> NODEACCESSPOLICY SHARED
> JOBMAXOVERRUN 99:00:00:00
> DEFERCOUNT 8192
> DEFERTIME 0
> CLASSCFG[developer] FSTARGET=40.00+
> CLASSCFG[lowprio] PRIORITY=-1000
> SRCFG[developer] CLASSLIST=developer
> SRCFG[developer] ACCESS=dedicated
> SRCFG[developer] DAYS=Mon,Tue,Wed,Thu,Fri
> SRCFG[developer] STARTTIME=08:00:00
> SRCFG[developer] ENDTIME=18:00:00
> SRCFG[developer] TIMELIMIT=2:00:00
> SRCFG[developer] RESOURCES=PROCS(8)
> USERCFG[DEFAULT] FSTARGET=100.0
>
> ===============
>
> -Lance
>
>
> On Nov 18, 2011, at 9:47 AM, Steve Crusan wrote:
>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>>
>> On Nov 18, 2011, at 9:33 AM, Lance Westerhoff wrote:
>>
>>> The request that is placed is for procs=60. Both torque and maui see that there are only 53 processors available and instead of letting the job sit in the queue and wait for all 60 processors to become available, it goes ahead and runs the job with what's available. Now if the user could ask for procs=[50-60] where 50 is the minimum number of processors to provide and 60 is the maximum, this would be a feature. But as it stands, if the user asks for 60 processors and ends up with 2 processors, the job just won't scale properly and he may as well kill it (when it shouldn't have run anyway).
>>
>> Hi Lance,
>>
>> Can you post the output of checkjob <jobid> of an incorrectly running job. Let's take a look at what Maui thinks the job is asking for.
>>
>> Might as well add your maui.cfg file also.
>>
>> I've found in the past that procs= is troublesome...
>>
>>>
>>> I'm actually beginning to think the problem may be related to maui. Perhaps I'll post this same question to the maui list and see what comes back.
>>>
>>> This problem is infuriating though since without the functionality working as it should, using procs=X in torque/maui makes torque/maui work more like a submission and run system (not a queuing system).
>>
>> Agreed. HPC cluster job management is normally be set it and forget it. Anything else other than maintenance/break fixes/new features would be ridiculously time consuming.
>>
>>>
>>> -Lance
>>>
>>>
>>>>
>>>> Message: 3
>>>> Date: Thu, 17 Nov 2011 17:29:17 -0800
>>>> From: "Brock Palen" <brockp at umich.edu>
>>>> Subject: Re: [torqueusers] procs= not working as documented
>>>> To: "Torque Users Mailing List" <torqueusers at supercluster.org>
>>>> Message-ID: <20111118012930.C635E83A8026 at mail.adaptivecomputing.com>
>>>> Content-Type: text/plain; charset="utf-8"
>>>>
>>>> Does maui only see one cpu or does mpiexec only see one cpu?
>>>>
>>>>
>>>>
>>>> Brock Palen
>>>> (734)936-1985
>>>> brockp at umich.edu
>>>> - Sent from my Palm Pre, please excuse typos
>>>> On Nov 17, 2011 3:19 PM, Lance Westerhoff <lance at quantumbioinc.com> wrote:
>>>>
>>>>
>>>>
>>>> Hello All-
>>>>
>>>>
>>>>
>>>> It appears that when running with the following specs, the procs= option does not actually work as expected.
>>>>
>>>>
>>>>
>>>> ==========================================
>>>>
>>>>
>>>>
>>>> #PBS -S /bin/bash
>>>>
>>>> #PBS -l procs=60
>>>>
>>>> #PBS -l pmem=700mb
>>>>
>>>> #PBS -l walltime=744:00:00
>>>>
>>>> #PBS -j oe
>>>>
>>>> #PBS -q batch
>>>>
>>>>
>>>>
>>>> torque version: tried 3.0.2. in v2.5.4, I think the procs option worked as documented
>>>>
>>>> maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete fail in terms of the procs option and it only asks for a single CPU)
>>>>
>>>>
>>>>
>>>> ==========================================
>>>>
>>>>
>>>>
>>>> If there are fewer then 60 processors available in the cluster (in this case there were 53 available) the job will go in an take whatever is left instead of waiting for all 60 processors to free up. Any thoughts as to why this might be happening? Sometimes it doesn't really matter and 53 would be almost as good as 60, however if only 2 processors are available and the user asks for 60, I would hate for him to go in.
>>>>
>>>>
>>>>
>>>> Thank you for your time!
>>>>
>>>>
>>>>
>>>> -Lance
>>>>
>>>>
>>>>
>>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>> ----------------------
>> Steve Crusan
>> System Administrator
>> Center for Research Computing
>> University of Rochester
>> https://www.crc.rochester.edu/
>>
>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
>> Comment: GPGTools - http://gpgtools.org
>>
>> iQEcBAEBAgAGBQJOxnAEAAoJENS19LGOpgqK2CEH/Ry+THjmhxdTzcIZ5d5YYCP/
>> bYQY2QthvbaEkUhh+q26m2EWrmPGHRgW9zXOx/fRBE2ejZE+EycpRLMdWDTOxn28
>> cK1qs+ITaiOevNbxufd7pt/P5hhvafQgsDtuy8RPGokgqSuRBEH9i8DZAFfIASQZ
>> tQ9YE5MSqEfaoTSwOVP2PXJCgEJh2ZU5GHO2UvmxF4SX4+7HePUgQYzmzIBu2cW8
>> JeeIpaf2AuNIvXjG3ZNA3FjHWQEZefiZhRTQxeE1PHuQCLWPnfTwz0nzquCHZBJv
>> Ufc1wOGanDi+LosRldVIUgAyHGcAcOvZzFnxlfNrYa2xfJSCyuC86YB4XNfpO1c=
>> =AGW7
>> -----END PGP SIGNATURE-----
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
----------------------
Steve Crusan
System Administrator
Center for Research Computing
University of Rochester
https://www.crc.rochester.edu/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: GPGTools - http://gpgtools.org
iQEcBAEBAgAGBQJOxrdrAAoJENS19LGOpgqKDuwIAIfU+NDyoUfD3TGcQ0ol2JJV
DXjvHd2ci3nJZaX28XEreQfOhrSA9GTTG1/x+wlwj+PdeNXXNedumnkZnSFQ38yp
mwArdbSuby3fAjO11qrsqL34u5LV0FYtMpLpA2ibYcHEHS7L6eLvYUNYLp5DPEB6
qbHHDINw086Jcf9qzPfbggjWfxp63mil5Az14JSv3nWm2p+KaEommocN6I5lJycr
DOs7V0ejZjys+F5ZRbzc2DDQzHVgEwTf6f0itW5ZNQy0BdnlLXBhOimGijpfaV7T
Hqug/ljzZ3Z229uEA8JUvS7/bnP9+QfPRacKEAC5t4xjHS1vnWOXlDbFP4fUskQ=
=avg0
-----END PGP SIGNATURE-----
More information about the torqueusers
mailing list