[Mauiusers] torque/maui disregarding pmem with procs

Lance Westerhoff lance at quantumbioinc.com
Wed Oct 26 22:15:47 MDT 2011


Hi Gareth-

The vmem and pvmem options doesn't seem to make any difference. 

Incidentally, since the different nodes have different size memory compliments, and I will be distributing multi-processor jobs across nodes, I absolutely need something that will allocate on a per-processor memory basis. Here's a more detailed rundown of what is available in the cluster:

$ diagnose -n
diagnosing node table (5120 slots)
Name                    State  Procs     Memory         Disk          Swap      Speed  Opsys   Arch Par   Load Res Classes                        Network                        Features              

compute-0-16             Busy   0:8      489:7985        1:1        8341:9985    1.00  linux [NONE] DEF   8.02 005 [developer_8:8][lowprio_8:8][b [DEFAULT]                      [NONE]              
compute-0-15             Busy   0:8     1185:7985        1:1        9455:9985    1.00  linux [NONE] DEF   8.03 009 [developer_8:8][lowprio_8:8][b [DEFAULT]                      [NONE]              
compute-0-14             Busy   0:8     1185:7985        1:1        9453:9985    1.00  linux [NONE] DEF   8.00 001 [developer_8:8][lowprio_8:8][b [DEFAULT]                      [NONE]              
compute-0-13             Busy   0:8     1185:7985        1:1        9494:9985    1.00  linux [NONE] DEF   8.00 001 [developer_8:8][lowprio_8:8][b [DEFAULT]                      [NONE]              
compute-0-12             Busy   0:8     5213:12013       1:1       13340:14013   1.00  linux [NONE] DEF   8.04 001 [developer_8:8][lowprio_8:8][b [DEFAULT]                      [NONE]              
compute-0-11             Busy   0:8     5213:12013       1:1       11338:14013   1.00  linux [NONE] DEF   8.01 008 [developer_8:8][lowprio_8:8][b [DEFAULT]                      [NONE]              
compute-0-9              Busy   0:8     5213:12013       1:1       11402:14013   1.00  linux [NONE] DEF   8.00 008 [developer_8:8][lowprio_8:8][b [DEFAULT]                      [NONE]              
compute-0-8              Busy   0:8     5213:12013       1:1       11379:14013   1.00  linux [NONE] DEF   8.00 008 [developer_8:8][lowprio_8:8][b [DEFAULT]                      [NONE]              
compute-0-7              Busy   0:8     5213:12013       1:1       11391:14013   1.00  linux [NONE] DEF   8.01 008 [developer_8:8][lowprio_8:8][b [DEFAULT]                      [NONE]              
compute-0-6              Busy   0:8     5213:12013       1:1       11404:14013   1.00  linux [NONE] DEF   8.00 008 [developer_8:8][lowprio_8:8][b [DEFAULT]                      [NONE]              
compute-0-5              Busy   0:8     5213:12013       1:1       11371:14013   1.00  linux [NONE] DEF   8.00 008 [developer_8:8][lowprio_8:8][b [DEFAULT]                      [NONE]              
compute-0-4              Busy   0:8     9253:16053       1:1       15451:18053   1.00  linux [NONE] DEF   8.00 008 [developer_8:8][lowprio_8:8][b [DEFAULT]                      [NONE]              
compute-0-3              Busy   0:8     7725:24109       1:1       25353:26109   1.00  linux [NONE] DEF   8.00 001 [developer_8:8][lowprio_8:8][b [DEFAULT]                      [lustre]            
compute-0-2              Busy   0:8     7725:24109       1:1       25374:26109   1.00  linux [NONE] DEF   7.99 001 [developer_8:8][lowprio_8:8][b [DEFAULT]                      [lustre]            
compute-0-1              Busy   0:8     7725:24109       1:1       25146:26109   1.00  linux [NONE] DEF   7.94 001 [developer_8:8][lowprio_8:8][b [DEFAULT]                      [lustre]            
compute-0-0              Busy   0:8     7725:24109       1:1       25379:26109   1.00  linux [NONE] DEF   8.00 001 [developer_8:8][lowprio_8:8][b [DEFAULT]                      [NONE]              
compute-0-10             Busy   0:2    10313:12013       1:1       12673:14013   1.00  linux [NONE] DEF   2.62 002 [developer_2:2][lowprio_2:2][b [DEFAULT]                      [NONE]              
compute-0-17             Busy   0:8     1185:7985        1:1        7420:9985    1.00  linux [NONE] DEF   8.00 008 [developer_8:8][lowprio_8:8][b [DEFAULT]                      [testbox]           
-----                     ---   0:138  92186:248518     18:18     255164:284518

Total Nodes: 18  (Active: 18  Idle: 0  Down: 0)

Thanks!

-Lance

On Oct 26, 2011, at 7:07 PM, <Gareth.Williams at csiro.au> wrote:

> Hi Lance,
> 
> Does maui locate appropriate nodes if you specify:
> -l procs=24,vmem=29600mb
> ?
> That's what I'd do.  It will not limit the memory per process (loosely speaking) but the main problem is which nodes are allocated.
> 
> Gareth
> 
> 
>> -----Original Message-----
>> From: Lance Westerhoff [mailto:lance at quantumbioinc.com]
>> Sent: Thursday, 27 October 2011 2:31 AM
>> To: mauiusers at supercluster.org
>> Subject: [Mauiusers] torque/maui disregarding pmem with procs
>> 
>> 
>> Hello all-
>> 
>> (I sent this email to the torque list, but I'm wondering if it might be
>> a maui problem).
>> 
>> We are trying to use procs= and pmem= on an 18 node (152core) cluster
>> with nodes of various memory size. pbsnodes shows the correct memory
>> complement for each node, so apparently PBS is getting the right specs
>> (see the output of pbsnodes below for more information). If we use the
>> following settings in the PBS script, invariably torque/maui will try
>> to fill up the all 8 of the 8 cores of each node. That is even though
>> there is nowhere near enough memory on any of these nodes for
>> 8*3700mb=29600mb. Considering the physical memory limit goes from 8GB
>> to 24GB depending upon the node, this is just taking down nodes left
>> and right.
>> 
>> Below I have provided a small example along with the associated output.
>> I also provided the output for pbsnodes in case there is something I am
>> missing here.
>> 
>> Thanks for your help!  -Lance
>> 
>> torque version: tried 2.5.4, 2.5.8, and 3.0.2 - all exhibit the same
>> problem.
>> maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete fail
>> in terms of the procs option and it only asks for a single CPU)
>> 
>> $ cat tmp.pbs
>> #!/bin/bash
>> #PBS -S /bin/bash
>> #PBS -l procs=24
>> #PBS -l pmem=3700mb
>> #PBS -l walltime=6:00:00
>> #PBS -j oe
>> 
>> cat $PBS_NODEFILE
>> 
>> $ qsub tmp.pbs
>> 337003.XXXX
>> $ wc -l tmp.pbs.o337003
>> 24 tmp.pbs.o337003
>> $ cat tmp.pbs.o337003
>> compute-0-14
>> compute-0-14
>> compute-0-14
>> compute-0-14
>> compute-0-14
>> compute-0-14
>> compute-0-14
>> compute-0-14
>> compute-0-15
>> compute-0-15
>> compute-0-15
>> compute-0-15
>> compute-0-15
>> compute-0-15
>> compute-0-15
>> compute-0-15
>> compute-0-16
>> compute-0-16
>> compute-0-16
>> compute-0-16
>> compute-0-16
>> compute-0-16
>> compute-0-16
>> compute-0-16
>> 
>> $ pbsnodes -a
>> compute-0-16
>>   state = free
>>   np = 8
>>   ntype = cluster
>>   status =
>> rectime=1319219085,varattr=,jobs=,state=free,netload=1834011936,gres=,l
>> oadave=0.00,ncpus=8,physmem=8177300kb,availmem=10095652kb,totmem=102255
>> 76kb,idletime=5582,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>> compute-0-16.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>> 2011 x86_64,opsys=linux
>>   gpus = 0
>> 
>> compute-0-15
>>   state = free
>>   np = 8
>>   ntype = cluster
>>   status =
>> rectime=1319219090,varattr=,jobs=,state=free,netload=700017694,gres=,lo
>> adave=0.00,ncpus=8,physmem=8177300kb,availmem=10150996kb,totmem=1022557
>> 6kb,idletime=5606,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>> compute-0-15.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>> 2011 x86_64,opsys=linux
>>   gpus = 0
>> 
>> compute-0-14
>>   state = free
>>   np = 8
>>   ntype = cluster
>>   status =
>> rectime=1319219090,varattr=,jobs=,state=free,netload=1003164957,gres=,l
>> oadave=0.00,ncpus=8,physmem=8177300kb,availmem=10131180kb,totmem=102255
>> 76kb,idletime=5615,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>> compute-0-14.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>> 2011 x86_64,opsys=linux
>>   gpus = 0
>> 
>> compute-0-13
>>   state = free
>>   np = 8
>>   ntype = cluster
>>   status =
>> rectime=1319219090,varattr=,jobs=,state=free,netload=1173266470,gres=,l
>> oadave=0.00,ncpus=8,physmem=8177300kb,availmem=10132104kb,totmem=102255
>> 76kb,idletime=5637,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>> compute-0-13.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>> 2011 x86_64,opsys=linux
>>   gpus = 0
>> 
>> compute-0-12
>>   state = free
>>   np = 8
>>   ntype = cluster
>>   status =
>> rectime=1319219090,varattr=,jobs=,state=free,netload=3991477,gres=,load
>> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14276448kb,totmem=14350232
>> kb,idletime=5604,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>> compute-0-12.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>> 2011 x86_64,opsys=linux
>>   gpus = 0
>> 
>> compute-0-11
>>   state = free
>>   np = 8
>>   ntype = cluster
>>   status =
>> rectime=1319219090,varattr=,jobs=,state=free,netload=2947879,gres=,load
>> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14274604kb,totmem=14350232
>> kb,idletime=5588,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>> compute-0-11.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>> 2011 x86_64,opsys=linux
>>   gpus = 0
>> 
>> compute-0-9
>>   state = free
>>   np = 8
>>   ntype = cluster
>>   status =
>> rectime=1319219090,varattr=,jobs=,state=free,netload=3721396,gres=,load
>> ave=0.05,ncpus=8,physmem=12301956kb,availmem=14253816kb,totmem=14350232
>> kb,idletime=5660,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>> compute-0-9.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>> 2011 x86_64,opsys=linux
>>   gpus = 0
>> 
>> compute-0-8
>>   state = free
>>   np = 8
>>   ntype = cluster
>>   status =
>> rectime=1319219090,varattr=,jobs=,state=free,netload=2934478,gres=,load
>> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14254796kb,totmem=14350232
>> kb,idletime=5675,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>> compute-0-8.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>> 2011 x86_64,opsys=linux
>>   gpus = 0
>> 
>> compute-0-7
>>   state = free
>>   np = 8
>>   ntype = cluster
>>   status =
>> rectime=1319219090,varattr=,jobs=,state=free,netload=2909406,gres=,load
>> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14254812kb,totmem=14350232
>> kb,idletime=5489,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>> compute-0-7.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>> 2011 x86_64,opsys=linux
>>   gpus = 0
>> 
>> compute-0-6
>>   state = free
>>   np = 8
>>   ntype = cluster
>>   status =
>> rectime=1319219090,varattr=,jobs=,state=free,netload=2936791,gres=,load
>> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14275644kb,totmem=14350232
>> kb,idletime=5748,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>> compute-0-6.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>> 2011 x86_64,opsys=linux
>>   gpus = 0
>> 
>> compute-0-5
>>   state = free
>>   np = 8
>>   ntype = cluster
>>   status =
>> rectime=1319219090,varattr=,jobs=,state=free,netload=2966183,gres=,load
>> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14276260kb,totmem=14350232
>> kb,idletime=5695,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>> compute-0-5.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>> 2011 x86_64,opsys=linux
>>   gpus = 0
>> 
>> compute-0-4
>>   state = free
>>   np = 8
>>   ntype = cluster
>>   status =
>> rectime=1319219090,varattr=,jobs=,state=free,netload=2886627,gres=,load
>> ave=0.00,ncpus=8,physmem=16438900kb,availmem=18412332kb,totmem=18487176
>> kb,idletime=5634,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>> compute-0-4.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>> 2011 x86_64,opsys=linux
>>   gpus = 0
>> 
>> compute-0-3
>>   state = free
>>   np = 8
>>   properties = lustre
>>   ntype = cluster
>>   status =
>> rectime=1319219108,varattr=,jobs=,state=free,netload=436527254,gres=,lo
>> adave=0.00,ncpus=8,physmem=24688212kb,availmem=26636656kb,totmem=267364
>> 88kb,idletime=2224,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>> compute-0-3.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>> 2011 x86_64,opsys=linux
>>   gpus = 0
>> 
>> compute-0-2
>>   state = free
>>   np = 8
>>   properties = lustre
>>   ntype = cluster
>>   status =
>> rectime=1319219106,varattr=,jobs=,state=free,netload=1184385,gres=,load
>> ave=0.00,ncpus=8,physmem=24688212kb,availmem=26659668kb,totmem=26736488
>> kb,idletime=2223,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>> compute-0-2.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>> 2011 x86_64,opsys=linux
>>   gpus = 0
>> 
>> compute-0-1
>>   state = free
>>   np = 8
>>   properties = lustre
>>   ntype = cluster
>>   status =
>> rectime=1319219102,varattr=,jobs=,state=free,netload=1258074,gres=,load
>> ave=0.00,ncpus=8,physmem=24688212kb,availmem=26657304kb,totmem=26736488
>> kb,idletime=2228,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>> compute-0-1.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>> 2011 x86_64,opsys=linux
>>   gpus = 0
>> 
>> compute-0-0
>>   state = free
>>   np = 8
>>   ntype = cluster
>>   status =
>> rectime=1319219090,varattr=,jobs=,state=free,netload=3416356,gres=,load
>> ave=0.00,ncpus=8,physmem=24688212kb,availmem=26635624kb,totmem=26736488
>> kb,idletime=5603,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>> compute-0-0.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>> 2011 x86_64,opsys=linux
>>   gpus = 0
>> 
>> compute-0-10
>>   state = free
>>   np = 2
>>   ntype = cluster
>>   status =
>> rectime=1319219090,varattr=,jobs=,state=free,netload=283846193,gres=,lo
>> adave=0.23,ncpus=8,physmem=12301956kb,availmem=13762696kb,totmem=143502
>> 32kb,idletime=5622,nusers=1,nsessions=1,sessions=3410,uname=Linux
>> compute-0-10.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>> 2011 x86_64,opsys=linux
>>   gpus = 0
>> 
>> compute-0-17
>>   state = free
>>   np = 8
>>   properties = testbox
>>   ntype = cluster
>>   status =
>> rectime=1319219090,varattr=,jobs=,state=free,netload=2948331,gres=,load
>> ave=0.00,ncpus=8,physmem=8177300kb,availmem=10144432kb,totmem=10225576k
>> b,idletime=5558,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
>> compute-0-17.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
>> 2011 x86_64,opsys=linux
>>   gpus = 0
>> 
>> 
> 



More information about the mauiusers mailing list