[Mauiusers] torque/maui disregarding pmem with procs
Gareth.Williams at csiro.au
Gareth.Williams at csiro.au
Wed Oct 26 17:07:23 MDT 2011
Hi Lance,
Does maui locate appropriate nodes if you specify:
-l procs=24,vmem=29600mb
?
That's what I'd do. It will not limit the memory per process (loosely speaking) but the main problem is which nodes are allocated.
Gareth
> -----Original Message-----
> From: Lance Westerhoff [mailto:lance at quantumbioinc.com]
> Sent: Thursday, 27 October 2011 2:31 AM
> To: mauiusers at supercluster.org
> Subject: [Mauiusers] torque/maui disregarding pmem with procs
>
>
> Hello all-
>
> (I sent this email to the torque list, but I'm wondering if it might be
> a maui problem).
>
> We are trying to use procs= and pmem= on an 18 node (152core) cluster
> with nodes of various memory size. pbsnodes shows the correct memory
> complement for each node, so apparently PBS is getting the right specs
> (see the output of pbsnodes below for more information). If we use the
> following settings in the PBS script, invariably torque/maui will try
> to fill up the all 8 of the 8 cores of each node. That is even though
> there is nowhere near enough memory on any of these nodes for
> 8*3700mb=29600mb. Considering the physical memory limit goes from 8GB
> to 24GB depending upon the node, this is just taking down nodes left
> and right.
>
> Below I have provided a small example along with the associated output.
> I also provided the output for pbsnodes in case there is something I am
> missing here.
>
> Thanks for your help! -Lance
>
> torque version: tried 2.5.4, 2.5.8, and 3.0.2 - all exhibit the same
> problem.
> maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete fail
> in terms of the procs option and it only asks for a single CPU)
>
> $ cat tmp.pbs
> #!/bin/bash
> #PBS -S /bin/bash
> #PBS -l procs=24
> #PBS -l pmem=3700mb
> #PBS -l walltime=6:00:00
> #PBS -j oe
>
> cat $PBS_NODEFILE
>
> $ qsub tmp.pbs
> 337003.XXXX
> $ wc -l tmp.pbs.o337003
> 24 tmp.pbs.o337003
> $ cat tmp.pbs.o337003
> compute-0-14
> compute-0-14
> compute-0-14
> compute-0-14
> compute-0-14
> compute-0-14
> compute-0-14
> compute-0-14
> compute-0-15
> compute-0-15
> compute-0-15
> compute-0-15
> compute-0-15
> compute-0-15
> compute-0-15
> compute-0-15
> compute-0-16
> compute-0-16
> compute-0-16
> compute-0-16
> compute-0-16
> compute-0-16
> compute-0-16
> compute-0-16
>
> $ pbsnodes -a
> compute-0-16
> state = free
> np = 8
> ntype = cluster
> status =
> rectime=1319219085,varattr=,jobs=,state=free,netload=1834011936,gres=,l
> oadave=0.00,ncpus=8,physmem=8177300kb,availmem=10095652kb,totmem=102255
> 76kb,idletime=5582,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
> compute-0-16.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
> 2011 x86_64,opsys=linux
> gpus = 0
>
> compute-0-15
> state = free
> np = 8
> ntype = cluster
> status =
> rectime=1319219090,varattr=,jobs=,state=free,netload=700017694,gres=,lo
> adave=0.00,ncpus=8,physmem=8177300kb,availmem=10150996kb,totmem=1022557
> 6kb,idletime=5606,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
> compute-0-15.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
> 2011 x86_64,opsys=linux
> gpus = 0
>
> compute-0-14
> state = free
> np = 8
> ntype = cluster
> status =
> rectime=1319219090,varattr=,jobs=,state=free,netload=1003164957,gres=,l
> oadave=0.00,ncpus=8,physmem=8177300kb,availmem=10131180kb,totmem=102255
> 76kb,idletime=5615,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
> compute-0-14.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
> 2011 x86_64,opsys=linux
> gpus = 0
>
> compute-0-13
> state = free
> np = 8
> ntype = cluster
> status =
> rectime=1319219090,varattr=,jobs=,state=free,netload=1173266470,gres=,l
> oadave=0.00,ncpus=8,physmem=8177300kb,availmem=10132104kb,totmem=102255
> 76kb,idletime=5637,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
> compute-0-13.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
> 2011 x86_64,opsys=linux
> gpus = 0
>
> compute-0-12
> state = free
> np = 8
> ntype = cluster
> status =
> rectime=1319219090,varattr=,jobs=,state=free,netload=3991477,gres=,load
> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14276448kb,totmem=14350232
> kb,idletime=5604,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
> compute-0-12.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
> 2011 x86_64,opsys=linux
> gpus = 0
>
> compute-0-11
> state = free
> np = 8
> ntype = cluster
> status =
> rectime=1319219090,varattr=,jobs=,state=free,netload=2947879,gres=,load
> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14274604kb,totmem=14350232
> kb,idletime=5588,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
> compute-0-11.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
> 2011 x86_64,opsys=linux
> gpus = 0
>
> compute-0-9
> state = free
> np = 8
> ntype = cluster
> status =
> rectime=1319219090,varattr=,jobs=,state=free,netload=3721396,gres=,load
> ave=0.05,ncpus=8,physmem=12301956kb,availmem=14253816kb,totmem=14350232
> kb,idletime=5660,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
> compute-0-9.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
> 2011 x86_64,opsys=linux
> gpus = 0
>
> compute-0-8
> state = free
> np = 8
> ntype = cluster
> status =
> rectime=1319219090,varattr=,jobs=,state=free,netload=2934478,gres=,load
> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14254796kb,totmem=14350232
> kb,idletime=5675,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
> compute-0-8.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
> 2011 x86_64,opsys=linux
> gpus = 0
>
> compute-0-7
> state = free
> np = 8
> ntype = cluster
> status =
> rectime=1319219090,varattr=,jobs=,state=free,netload=2909406,gres=,load
> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14254812kb,totmem=14350232
> kb,idletime=5489,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
> compute-0-7.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
> 2011 x86_64,opsys=linux
> gpus = 0
>
> compute-0-6
> state = free
> np = 8
> ntype = cluster
> status =
> rectime=1319219090,varattr=,jobs=,state=free,netload=2936791,gres=,load
> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14275644kb,totmem=14350232
> kb,idletime=5748,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
> compute-0-6.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
> 2011 x86_64,opsys=linux
> gpus = 0
>
> compute-0-5
> state = free
> np = 8
> ntype = cluster
> status =
> rectime=1319219090,varattr=,jobs=,state=free,netload=2966183,gres=,load
> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14276260kb,totmem=14350232
> kb,idletime=5695,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
> compute-0-5.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
> 2011 x86_64,opsys=linux
> gpus = 0
>
> compute-0-4
> state = free
> np = 8
> ntype = cluster
> status =
> rectime=1319219090,varattr=,jobs=,state=free,netload=2886627,gres=,load
> ave=0.00,ncpus=8,physmem=16438900kb,availmem=18412332kb,totmem=18487176
> kb,idletime=5634,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
> compute-0-4.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
> 2011 x86_64,opsys=linux
> gpus = 0
>
> compute-0-3
> state = free
> np = 8
> properties = lustre
> ntype = cluster
> status =
> rectime=1319219108,varattr=,jobs=,state=free,netload=436527254,gres=,lo
> adave=0.00,ncpus=8,physmem=24688212kb,availmem=26636656kb,totmem=267364
> 88kb,idletime=2224,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
> compute-0-3.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
> 2011 x86_64,opsys=linux
> gpus = 0
>
> compute-0-2
> state = free
> np = 8
> properties = lustre
> ntype = cluster
> status =
> rectime=1319219106,varattr=,jobs=,state=free,netload=1184385,gres=,load
> ave=0.00,ncpus=8,physmem=24688212kb,availmem=26659668kb,totmem=26736488
> kb,idletime=2223,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
> compute-0-2.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
> 2011 x86_64,opsys=linux
> gpus = 0
>
> compute-0-1
> state = free
> np = 8
> properties = lustre
> ntype = cluster
> status =
> rectime=1319219102,varattr=,jobs=,state=free,netload=1258074,gres=,load
> ave=0.00,ncpus=8,physmem=24688212kb,availmem=26657304kb,totmem=26736488
> kb,idletime=2228,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
> compute-0-1.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
> 2011 x86_64,opsys=linux
> gpus = 0
>
> compute-0-0
> state = free
> np = 8
> ntype = cluster
> status =
> rectime=1319219090,varattr=,jobs=,state=free,netload=3416356,gres=,load
> ave=0.00,ncpus=8,physmem=24688212kb,availmem=26635624kb,totmem=26736488
> kb,idletime=5603,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
> compute-0-0.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
> 2011 x86_64,opsys=linux
> gpus = 0
>
> compute-0-10
> state = free
> np = 2
> ntype = cluster
> status =
> rectime=1319219090,varattr=,jobs=,state=free,netload=283846193,gres=,lo
> adave=0.23,ncpus=8,physmem=12301956kb,availmem=13762696kb,totmem=143502
> 32kb,idletime=5622,nusers=1,nsessions=1,sessions=3410,uname=Linux
> compute-0-10.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
> 2011 x86_64,opsys=linux
> gpus = 0
>
> compute-0-17
> state = free
> np = 8
> properties = testbox
> ntype = cluster
> status =
> rectime=1319219090,varattr=,jobs=,state=free,netload=2948331,gres=,load
> ave=0.00,ncpus=8,physmem=8177300kb,availmem=10144432kb,totmem=10225576k
> b,idletime=5558,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
> compute-0-17.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
> 2011 x86_64,opsys=linux
> gpus = 0
>
>
More information about the mauiusers
mailing list