[Mauiusers] Strange queue/scheduler issue

Steve Crusan scrusan at ur.rochester.edu
Mon Nov 7 16:02:04 MST 2011


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


On Nov 7, 2011, at 5:56 PM, Roy Dragseth wrote:

> Have you tried to set a limitation on the number of Idle jobs allowed per 
> user?
> 
> For instance
> 
> USRCFG[DEFAULT] MAXIJOB=10
> 


Correct above. Also, you can define the max amount of jobs queued via TORQUE, I think max_user_queuable.

I've run into a situation where a user submitted too many jobs for the scheduler to handle (job wouldn't even be given a state), and I was required to set a something in MAUI for max jobs. 

Try Roy's fix above first though. The TORQUE limit is great for a hard and easy to see user limit, i.e. they get feedback when they break the rules.

> r.
> 
> 
> On Monday 7. November 2011 23.44.47 Ian Miller wrote:
>> Not sure if this is the correct forum for this but
>> We have  320 core Grid with Maui & torque running.  Three queues are setup
>> up with two nodes (24 core) for one of them and another with two
>> exclusively and three nodes sharing the default queue.
>> When someone submit say 4000 jobs to the default queue.  No one can submit
>> any jobs to either of the other queue.  They just sit in Q status.   This
>> started about three days ago and the users and total in an uproar about
>> it.
>> 
>> Any thought would on where to find the bottle neck of a config setting
>> would be helpful.
>> 
>> -I
>> 
>> 
>> 
>> Ian Miller
>> System Administrator
>> ianm at uchicago.edu
>> 312-282-6507
>> 
>> 
>> 
>> 
>> 
>> 
>> On 10/26/11 6:07 PM, "Gareth.Williams at csiro.au" <Gareth.Williams at csiro.au>
>> 
>> wrote:
>>> Hi Lance,
>>> 
>>> Does maui locate appropriate nodes if you specify:
>>> -l procs=24,vmem=29600mb
>>> ?
>>> That's what I'd do.  It will not limit the memory per process (loosely
>>> speaking) but the main problem is which nodes are allocated.
>>> 
>>> Gareth
>>> 
>>>> -----Original Message-----
>>>> From: Lance Westerhoff [mailto:lance at quantumbioinc.com]
>>>> Sent: Thursday, 27 October 2011 2:31 AM
>>>> To: mauiusers at supercluster.org
>>>> Subject: [Mauiusers] torque/maui disregarding pmem with procs
>>>> 
>>>> 
>>>> Hello all-
>>>> 
>>>> (I sent this email to the torque list, but I'm wondering if it might be
>>>> a maui problem).
>>>> 
>>>> We are trying to use procs= and pmem= on an 18 node (152core) cluster
>>>> with nodes of various memory size. pbsnodes shows the correct memory
>>>> complement for each node, so apparently PBS is getting the right specs
>>>> (see the output of pbsnodes below for more information). If we use the
>>>> following settings in the PBS script, invariably torque/maui will try
>>>> to fill up the all 8 of the 8 cores of each node. That is even though
>>>> there is nowhere near enough memory on any of these nodes for
>>>> 8*3700mb=29600mb. Considering the physical memory limit goes from 8GB
>>>> to 24GB depending upon the node, this is just taking down nodes left
>>>> and right.
>>>> 
>>>> Below I have provided a small example along with the associated output.
>>>> I also provided the output for pbsnodes in case there is something I am
>>>> missing here.
>>>> 
>>>> Thanks for your help!  -Lance
>>>> 
>>>> torque version: tried 2.5.4, 2.5.8, and 3.0.2 - all exhibit the same
>>>> problem.
>>>> maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete fail
>>>> in terms of the procs option and it only asks for a single CPU)
>>>> 
>>>> $ cat tmp.pbs
>>>> #!/bin/bash
>>>> #PBS -S /bin/bash
>>>> #PBS -l procs=24
>>>> #PBS -l pmem=3700mb
>>>> #PBS -l walltime=6:00:00
>>>> #PBS -j oe
>>>> 
>>>> cat $PBS_NODEFILE
>>>> 
>>>> $ qsub tmp.pbs
>>>> 337003.XXXX
>>>> $ wc -l tmp.pbs.o337003
>>>> 24 tmp.pbs.o337003
>>>> $ cat tmp.pbs.o337003
>>>> compute-0-14
>>>> compute-0-14
>>>> compute-0-14
>>>> compute-0-14
>>>> compute-0-14
>>>> compute-0-14
>>>> compute-0-14
>>>> compute-0-14
>>>> compute-0-15
>>>> compute-0-15
>>>> compute-0-15
>>>> compute-0-15
>>>> compute-0-15
>>>> compute-0-15
>>>> compute-0-15
>>>> compute-0-15
>>>> compute-0-16
>>>> compute-0-16
>>>> compute-0-16
>>>> compute-0-16
>>>> compute-0-16
>>>> compute-0-16
>>>> compute-0-16
>>>> compute-0-16
>>>> 
>>>> $ pbsnodes -a
>>>> compute-0-16
>>>> 
>>>>   state = free
>>>>   np = 8
>>>>   ntype = cluster
>>>>   status =
>>>> 
>>>> rectime=1319219085,varattr=,jobs=,state=free,netload=1834011936,gres=,l
>>>> oadave=0.00,ncpus=8,physmem=8177300kb,availmem=10095652kb,totmem=102255
>>>> 76kb,idletime=5582,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>>>> compute-0-16.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>>>> 2011 x86_64,opsys=linux
>>>> 
>>>>   gpus = 0
>>>> 
>>>> compute-0-15
>>>> 
>>>>   state = free
>>>>   np = 8
>>>>   ntype = cluster
>>>>   status =
>>>> 
>>>> rectime=1319219090,varattr=,jobs=,state=free,netload=700017694,gres=,lo
>>>> adave=0.00,ncpus=8,physmem=8177300kb,availmem=10150996kb,totmem=1022557
>>>> 6kb,idletime=5606,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>>>> compute-0-15.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>>>> 2011 x86_64,opsys=linux
>>>> 
>>>>   gpus = 0
>>>> 
>>>> compute-0-14
>>>> 
>>>>   state = free
>>>>   np = 8
>>>>   ntype = cluster
>>>>   status =
>>>> 
>>>> rectime=1319219090,varattr=,jobs=,state=free,netload=1003164957,gres=,l
>>>> oadave=0.00,ncpus=8,physmem=8177300kb,availmem=10131180kb,totmem=102255
>>>> 76kb,idletime=5615,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>>>> compute-0-14.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>>>> 2011 x86_64,opsys=linux
>>>> 
>>>>   gpus = 0
>>>> 
>>>> compute-0-13
>>>> 
>>>>   state = free
>>>>   np = 8
>>>>   ntype = cluster
>>>>   status =
>>>> 
>>>> rectime=1319219090,varattr=,jobs=,state=free,netload=1173266470,gres=,l
>>>> oadave=0.00,ncpus=8,physmem=8177300kb,availmem=10132104kb,totmem=102255
>>>> 76kb,idletime=5637,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>>>> compute-0-13.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>>>> 2011 x86_64,opsys=linux
>>>> 
>>>>   gpus = 0
>>>> 
>>>> compute-0-12
>>>> 
>>>>   state = free
>>>>   np = 8
>>>>   ntype = cluster
>>>>   status =
>>>> 
>>>> rectime=1319219090,varattr=,jobs=,state=free,netload=3991477,gres=,load
>>>> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14276448kb,totmem=14350232
>>>> kb,idletime=5604,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>>>> compute-0-12.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>>>> 2011 x86_64,opsys=linux
>>>> 
>>>>   gpus = 0
>>>> 
>>>> compute-0-11
>>>> 
>>>>   state = free
>>>>   np = 8
>>>>   ntype = cluster
>>>>   status =
>>>> 
>>>> rectime=1319219090,varattr=,jobs=,state=free,netload=2947879,gres=,load
>>>> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14274604kb,totmem=14350232
>>>> kb,idletime=5588,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>>>> compute-0-11.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>>>> 2011 x86_64,opsys=linux
>>>> 
>>>>   gpus = 0
>>>> 
>>>> compute-0-9
>>>> 
>>>>   state = free
>>>>   np = 8
>>>>   ntype = cluster
>>>>   status =
>>>> 
>>>> rectime=1319219090,varattr=,jobs=,state=free,netload=3721396,gres=,load
>>>> ave=0.05,ncpus=8,physmem=12301956kb,availmem=14253816kb,totmem=14350232
>>>> kb,idletime=5660,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>>>> compute-0-9.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>>>> 2011 x86_64,opsys=linux
>>>> 
>>>>   gpus = 0
>>>> 
>>>> compute-0-8
>>>> 
>>>>   state = free
>>>>   np = 8
>>>>   ntype = cluster
>>>>   status =
>>>> 
>>>> rectime=1319219090,varattr=,jobs=,state=free,netload=2934478,gres=,load
>>>> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14254796kb,totmem=14350232
>>>> kb,idletime=5675,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>>>> compute-0-8.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>>>> 2011 x86_64,opsys=linux
>>>> 
>>>>   gpus = 0
>>>> 
>>>> compute-0-7
>>>> 
>>>>   state = free
>>>>   np = 8
>>>>   ntype = cluster
>>>>   status =
>>>> 
>>>> rectime=1319219090,varattr=,jobs=,state=free,netload=2909406,gres=,load
>>>> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14254812kb,totmem=14350232
>>>> kb,idletime=5489,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>>>> compute-0-7.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>>>> 2011 x86_64,opsys=linux
>>>> 
>>>>   gpus = 0
>>>> 
>>>> compute-0-6
>>>> 
>>>>   state = free
>>>>   np = 8
>>>>   ntype = cluster
>>>>   status =
>>>> 
>>>> rectime=1319219090,varattr=,jobs=,state=free,netload=2936791,gres=,load
>>>> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14275644kb,totmem=14350232
>>>> kb,idletime=5748,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>>>> compute-0-6.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>>>> 2011 x86_64,opsys=linux
>>>> 
>>>>   gpus = 0
>>>> 
>>>> compute-0-5
>>>> 
>>>>   state = free
>>>>   np = 8
>>>>   ntype = cluster
>>>>   status =
>>>> 
>>>> rectime=1319219090,varattr=,jobs=,state=free,netload=2966183,gres=,load
>>>> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14276260kb,totmem=14350232
>>>> kb,idletime=5695,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>>>> compute-0-5.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>>>> 2011 x86_64,opsys=linux
>>>> 
>>>>   gpus = 0
>>>> 
>>>> compute-0-4
>>>> 
>>>>   state = free
>>>>   np = 8
>>>>   ntype = cluster
>>>>   status =
>>>> 
>>>> rectime=1319219090,varattr=,jobs=,state=free,netload=2886627,gres=,load
>>>> ave=0.00,ncpus=8,physmem=16438900kb,availmem=18412332kb,totmem=18487176
>>>> kb,idletime=5634,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>>>> compute-0-4.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>>>> 2011 x86_64,opsys=linux
>>>> 
>>>>   gpus = 0
>>>> 
>>>> compute-0-3
>>>> 
>>>>   state = free
>>>>   np = 8
>>>>   properties = lustre
>>>>   ntype = cluster
>>>>   status =
>>>> 
>>>> rectime=1319219108,varattr=,jobs=,state=free,netload=436527254,gres=,lo
>>>> adave=0.00,ncpus=8,physmem=24688212kb,availmem=26636656kb,totmem=267364
>>>> 88kb,idletime=2224,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>>>> compute-0-3.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>>>> 2011 x86_64,opsys=linux
>>>> 
>>>>   gpus = 0
>>>> 
>>>> compute-0-2
>>>> 
>>>>   state = free
>>>>   np = 8
>>>>   properties = lustre
>>>>   ntype = cluster
>>>>   status =
>>>> 
>>>> rectime=1319219106,varattr=,jobs=,state=free,netload=1184385,gres=,load
>>>> ave=0.00,ncpus=8,physmem=24688212kb,availmem=26659668kb,totmem=26736488
>>>> kb,idletime=2223,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>>>> compute-0-2.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>>>> 2011 x86_64,opsys=linux
>>>> 
>>>>   gpus = 0
>>>> 
>>>> compute-0-1
>>>> 
>>>>   state = free
>>>>   np = 8
>>>>   properties = lustre
>>>>   ntype = cluster
>>>>   status =
>>>> 
>>>> rectime=1319219102,varattr=,jobs=,state=free,netload=1258074,gres=,load
>>>> ave=0.00,ncpus=8,physmem=24688212kb,availmem=26657304kb,totmem=26736488
>>>> kb,idletime=2228,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>>>> compute-0-1.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>>>> 2011 x86_64,opsys=linux
>>>> 
>>>>   gpus = 0
>>>> 
>>>> compute-0-0
>>>> 
>>>>   state = free
>>>>   np = 8
>>>>   ntype = cluster
>>>>   status =
>>>> 
>>>> rectime=1319219090,varattr=,jobs=,state=free,netload=3416356,gres=,load
>>>> ave=0.00,ncpus=8,physmem=24688212kb,availmem=26635624kb,totmem=26736488
>>>> kb,idletime=5603,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>>>> compute-0-0.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>>>> 2011 x86_64,opsys=linux
>>>> 
>>>>   gpus = 0
>>>> 
>>>> compute-0-10
>>>> 
>>>>   state = free
>>>>   np = 2
>>>>   ntype = cluster
>>>>   status =
>>>> 
>>>> rectime=1319219090,varattr=,jobs=,state=free,netload=283846193,gres=,lo
>>>> adave=0.23,ncpus=8,physmem=12301956kb,availmem=13762696kb,totmem=143502
>>>> 32kb,idletime=5622,nusers=1,nsessions=1,sessions=3410,uname=Linux
>>>> compute-0-10.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>>>> 2011 x86_64,opsys=linux
>>>> 
>>>>   gpus = 0
>>>> 
>>>> compute-0-17
>>>> 
>>>>   state = free
>>>>   np = 8
>>>>   properties = testbox
>>>>   ntype = cluster
>>>>   status =
>>>> 
>>>> rectime=1319219090,varattr=,jobs=,state=free,netload=2948331,gres=,load
>>>> ave=0.00,ncpus=8,physmem=8177300kb,availmem=10144432kb,totmem=10225576k
>>>> b,idletime=5558,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>>>> compute-0-17.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>>>> 2011 x86_64,opsys=linux
>>>> 
>>>>   gpus = 0
>>> 
>>> _______________________________________________
>>> mauiusers mailing list
>>> mauiusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>> 
>> _______________________________________________
>> mauiusers mailing list
>> mauiusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/mauiusers
> 
> -- 
> 
>  The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
> 	      phone:+47 77 64 41 07, fax:+47 77 64 41 00
>        Roy Dragseth, Team Leader, High Performance Computing
> 	 Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no
> _______________________________________________
> mauiusers mailing list
> mauiusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/mauiusers

 ----------------------
 Steve Crusan
 System Administrator
 Center for Research Computing
 University of Rochester
 https://www.crc.rochester.edu/


-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: GPGTools - http://gpgtools.org

iQEcBAEBAgAGBQJOuGN0AAoJENS19LGOpgqKwBAIAJ6vEibED4jiVxlCteZq2Jx2
enqAHhYMlFMgt/k+UKUWcsnwNkNUA0Z7SvsSipuavFLjdUphnr9dBIuwAJWir8NO
3OnDCzEkBe4m4i8a65pK3kurnwqAgqGU7i17+5RVw27yOfmmAU4IWBWE1N3Ql27m
4RLcA13BeBQcrPh4ig6lDTv9KL4xWO/g245rqNIgAmkb2jGAUzAKGcgL2qcDhxZg
1n8MSD0qbQ1n15Co9ETiZ3lEy+Ni/Nbi+3SBNm2yYmdpWWEYLABN3MZlPigzPOxA
acqnn1MyHJsuypA7c9/L+H9fB1lGZ1XGCiZKKBUzVajTYfTtvbeJXMhlZxg+9sM=
=Xkv5
-----END PGP SIGNATURE-----


More information about the mauiusers mailing list