[Mauiusers] Strange queue/scheduler issue

Ian Miller ianm at uchicago.edu
Mon Nov 7 15:44:47 MST 2011


Not sure if this is the correct forum for this but
We have  320 core Grid with Maui & torque running.  Three queues are setup
up with two nodes (24 core) for one of them and another with two
exclusively and three nodes sharing the default queue.
When someone submit say 4000 jobs to the default queue.  No one can submit
any jobs to either of the other queue.  They just sit in Q status.   This
started about three days ago and the users and total in an uproar about
it.  

Any thought would on where to find the bottle neck of a config setting
would be helpful. 

-I 



Ian Miller
System Administrator
ianm at uchicago.edu
312-282-6507






On 10/26/11 6:07 PM, "Gareth.Williams at csiro.au" <Gareth.Williams at csiro.au>
wrote:

>Hi Lance,
>
>Does maui locate appropriate nodes if you specify:
>-l procs=24,vmem=29600mb
>?
>That's what I'd do.  It will not limit the memory per process (loosely
>speaking) but the main problem is which nodes are allocated.
>
>Gareth
>
>
>> -----Original Message-----
>> From: Lance Westerhoff [mailto:lance at quantumbioinc.com]
>> Sent: Thursday, 27 October 2011 2:31 AM
>> To: mauiusers at supercluster.org
>> Subject: [Mauiusers] torque/maui disregarding pmem with procs
>>
>>
>> Hello all-
>>
>> (I sent this email to the torque list, but I'm wondering if it might be
>> a maui problem).
>>
>> We are trying to use procs= and pmem= on an 18 node (152core) cluster
>> with nodes of various memory size. pbsnodes shows the correct memory
>> complement for each node, so apparently PBS is getting the right specs
>> (see the output of pbsnodes below for more information). If we use the
>> following settings in the PBS script, invariably torque/maui will try
>> to fill up the all 8 of the 8 cores of each node. That is even though
>> there is nowhere near enough memory on any of these nodes for
>> 8*3700mb=29600mb. Considering the physical memory limit goes from 8GB
>> to 24GB depending upon the node, this is just taking down nodes left
>> and right.
>>
>> Below I have provided a small example along with the associated output.
>> I also provided the output for pbsnodes in case there is something I am
>> missing here.
>>
>> Thanks for your help!  -Lance
>>
>> torque version: tried 2.5.4, 2.5.8, and 3.0.2 - all exhibit the same
>> problem.
>> maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete fail
>> in terms of the procs option and it only asks for a single CPU)
>>
>> $ cat tmp.pbs
>> #!/bin/bash
>> #PBS -S /bin/bash
>> #PBS -l procs=24
>> #PBS -l pmem=3700mb
>> #PBS -l walltime=6:00:00
>> #PBS -j oe
>>
>> cat $PBS_NODEFILE
>>
>> $ qsub tmp.pbs
>> 337003.XXXX
>> $ wc -l tmp.pbs.o337003
>> 24 tmp.pbs.o337003
>> $ cat tmp.pbs.o337003
>> compute-0-14
>> compute-0-14
>> compute-0-14
>> compute-0-14
>> compute-0-14
>> compute-0-14
>> compute-0-14
>> compute-0-14
>> compute-0-15
>> compute-0-15
>> compute-0-15
>> compute-0-15
>> compute-0-15
>> compute-0-15
>> compute-0-15
>> compute-0-15
>> compute-0-16
>> compute-0-16
>> compute-0-16
>> compute-0-16
>> compute-0-16
>> compute-0-16
>> compute-0-16
>> compute-0-16
>>
>> $ pbsnodes -a
>> compute-0-16
>>    state = free
>>    np = 8
>>    ntype = cluster
>>    status =
>> rectime=1319219085,varattr=,jobs=,state=free,netload=1834011936,gres=,l
>> oadave=0.00,ncpus=8,physmem=8177300kb,availmem=10095652kb,totmem=102255
>> 76kb,idletime=5582,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>> compute-0-16.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>> 2011 x86_64,opsys=linux
>>    gpus = 0
>>
>> compute-0-15
>>    state = free
>>    np = 8
>>    ntype = cluster
>>    status =
>> rectime=1319219090,varattr=,jobs=,state=free,netload=700017694,gres=,lo
>> adave=0.00,ncpus=8,physmem=8177300kb,availmem=10150996kb,totmem=1022557
>> 6kb,idletime=5606,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>> compute-0-15.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>> 2011 x86_64,opsys=linux
>>    gpus = 0
>>
>> compute-0-14
>>    state = free
>>    np = 8
>>    ntype = cluster
>>    status =
>> rectime=1319219090,varattr=,jobs=,state=free,netload=1003164957,gres=,l
>> oadave=0.00,ncpus=8,physmem=8177300kb,availmem=10131180kb,totmem=102255
>> 76kb,idletime=5615,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>> compute-0-14.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>> 2011 x86_64,opsys=linux
>>    gpus = 0
>>
>> compute-0-13
>>    state = free
>>    np = 8
>>    ntype = cluster
>>    status =
>> rectime=1319219090,varattr=,jobs=,state=free,netload=1173266470,gres=,l
>> oadave=0.00,ncpus=8,physmem=8177300kb,availmem=10132104kb,totmem=102255
>> 76kb,idletime=5637,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>> compute-0-13.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>> 2011 x86_64,opsys=linux
>>    gpus = 0
>>
>> compute-0-12
>>    state = free
>>    np = 8
>>    ntype = cluster
>>    status =
>> rectime=1319219090,varattr=,jobs=,state=free,netload=3991477,gres=,load
>> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14276448kb,totmem=14350232
>> kb,idletime=5604,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>> compute-0-12.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>> 2011 x86_64,opsys=linux
>>    gpus = 0
>>
>> compute-0-11
>>    state = free
>>    np = 8
>>    ntype = cluster
>>    status =
>> rectime=1319219090,varattr=,jobs=,state=free,netload=2947879,gres=,load
>> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14274604kb,totmem=14350232
>> kb,idletime=5588,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>> compute-0-11.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>> 2011 x86_64,opsys=linux
>>    gpus = 0
>>
>> compute-0-9
>>    state = free
>>    np = 8
>>    ntype = cluster
>>    status =
>> rectime=1319219090,varattr=,jobs=,state=free,netload=3721396,gres=,load
>> ave=0.05,ncpus=8,physmem=12301956kb,availmem=14253816kb,totmem=14350232
>> kb,idletime=5660,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>> compute-0-9.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>> 2011 x86_64,opsys=linux
>>    gpus = 0
>>
>> compute-0-8
>>    state = free
>>    np = 8
>>    ntype = cluster
>>    status =
>> rectime=1319219090,varattr=,jobs=,state=free,netload=2934478,gres=,load
>> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14254796kb,totmem=14350232
>> kb,idletime=5675,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>> compute-0-8.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>> 2011 x86_64,opsys=linux
>>    gpus = 0
>>
>> compute-0-7
>>    state = free
>>    np = 8
>>    ntype = cluster
>>    status =
>> rectime=1319219090,varattr=,jobs=,state=free,netload=2909406,gres=,load
>> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14254812kb,totmem=14350232
>> kb,idletime=5489,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>> compute-0-7.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>> 2011 x86_64,opsys=linux
>>    gpus = 0
>>
>> compute-0-6
>>    state = free
>>    np = 8
>>    ntype = cluster
>>    status =
>> rectime=1319219090,varattr=,jobs=,state=free,netload=2936791,gres=,load
>> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14275644kb,totmem=14350232
>> kb,idletime=5748,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>> compute-0-6.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>> 2011 x86_64,opsys=linux
>>    gpus = 0
>>
>> compute-0-5
>>    state = free
>>    np = 8
>>    ntype = cluster
>>    status =
>> rectime=1319219090,varattr=,jobs=,state=free,netload=2966183,gres=,load
>> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14276260kb,totmem=14350232
>> kb,idletime=5695,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>> compute-0-5.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>> 2011 x86_64,opsys=linux
>>    gpus = 0
>>
>> compute-0-4
>>    state = free
>>    np = 8
>>    ntype = cluster
>>    status =
>> rectime=1319219090,varattr=,jobs=,state=free,netload=2886627,gres=,load
>> ave=0.00,ncpus=8,physmem=16438900kb,availmem=18412332kb,totmem=18487176
>> kb,idletime=5634,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>> compute-0-4.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>> 2011 x86_64,opsys=linux
>>    gpus = 0
>>
>> compute-0-3
>>    state = free
>>    np = 8
>>    properties = lustre
>>    ntype = cluster
>>    status =
>> rectime=1319219108,varattr=,jobs=,state=free,netload=436527254,gres=,lo
>> adave=0.00,ncpus=8,physmem=24688212kb,availmem=26636656kb,totmem=267364
>> 88kb,idletime=2224,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>> compute-0-3.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>> 2011 x86_64,opsys=linux
>>    gpus = 0
>>
>> compute-0-2
>>    state = free
>>    np = 8
>>    properties = lustre
>>    ntype = cluster
>>    status =
>> rectime=1319219106,varattr=,jobs=,state=free,netload=1184385,gres=,load
>> ave=0.00,ncpus=8,physmem=24688212kb,availmem=26659668kb,totmem=26736488
>> kb,idletime=2223,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>> compute-0-2.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>> 2011 x86_64,opsys=linux
>>    gpus = 0
>>
>> compute-0-1
>>    state = free
>>    np = 8
>>    properties = lustre
>>    ntype = cluster
>>    status =
>> rectime=1319219102,varattr=,jobs=,state=free,netload=1258074,gres=,load
>> ave=0.00,ncpus=8,physmem=24688212kb,availmem=26657304kb,totmem=26736488
>> kb,idletime=2228,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>> compute-0-1.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>> 2011 x86_64,opsys=linux
>>    gpus = 0
>>
>> compute-0-0
>>    state = free
>>    np = 8
>>    ntype = cluster
>>    status =
>> rectime=1319219090,varattr=,jobs=,state=free,netload=3416356,gres=,load
>> ave=0.00,ncpus=8,physmem=24688212kb,availmem=26635624kb,totmem=26736488
>> kb,idletime=5603,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>> compute-0-0.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>> 2011 x86_64,opsys=linux
>>    gpus = 0
>>
>> compute-0-10
>>    state = free
>>    np = 2
>>    ntype = cluster
>>    status =
>> rectime=1319219090,varattr=,jobs=,state=free,netload=283846193,gres=,lo
>> adave=0.23,ncpus=8,physmem=12301956kb,availmem=13762696kb,totmem=143502
>> 32kb,idletime=5622,nusers=1,nsessions=1,sessions=3410,uname=Linux
>> compute-0-10.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>> 2011 x86_64,opsys=linux
>>    gpus = 0
>>
>> compute-0-17
>>    state = free
>>    np = 8
>>    properties = testbox
>>    ntype = cluster
>>    status =
>> rectime=1319219090,varattr=,jobs=,state=free,netload=2948331,gres=,load
>> ave=0.00,ncpus=8,physmem=8177300kb,availmem=10144432kb,totmem=10225576k
>> b,idletime=5558,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>> compute-0-17.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>> 2011 x86_64,opsys=linux
>>    gpus = 0
>>
>>
>
>_______________________________________________
>mauiusers mailing list
>mauiusers at supercluster.org
>http://www.supercluster.org/mailman/listinfo/mauiusers



More information about the mauiusers mailing list