From ianm at uchicago.edu Mon Nov 7 15:44:47 2011 From: ianm at uchicago.edu (Ian Miller) Date: Mon, 7 Nov 2011 16:44:47 -0600 Subject: [Mauiusers] Strange queue/scheduler issue In-Reply-To: <007DECE986B47F4EABF823C1FBB19C620102BE58CAE4@exvic-mbx04.nexus.csiro.au> Message-ID: Not sure if this is the correct forum for this but We have 320 core Grid with Maui & torque running. Three queues are setup up with two nodes (24 core) for one of them and another with two exclusively and three nodes sharing the default queue. When someone submit say 4000 jobs to the default queue. No one can submit any jobs to either of the other queue. They just sit in Q status. This started about three days ago and the users and total in an uproar about it. Any thought would on where to find the bottle neck of a config setting would be helpful. -I Ian Miller System Administrator ianm at uchicago.edu 312-282-6507 On 10/26/11 6:07 PM, "Gareth.Williams at csiro.au" wrote: >Hi Lance, > >Does maui locate appropriate nodes if you specify: >-l procs=24,vmem=29600mb >? >That's what I'd do. It will not limit the memory per process (loosely >speaking) but the main problem is which nodes are allocated. > >Gareth > > >> -----Original Message----- >> From: Lance Westerhoff [mailto:lance at quantumbioinc.com] >> Sent: Thursday, 27 October 2011 2:31 AM >> To: mauiusers at supercluster.org >> Subject: [Mauiusers] torque/maui disregarding pmem with procs >> >> >> Hello all- >> >> (I sent this email to the torque list, but I'm wondering if it might be >> a maui problem). >> >> We are trying to use procs= and pmem= on an 18 node (152core) cluster >> with nodes of various memory size. pbsnodes shows the correct memory >> complement for each node, so apparently PBS is getting the right specs >> (see the output of pbsnodes below for more information). If we use the >> following settings in the PBS script, invariably torque/maui will try >> to fill up the all 8 of the 8 cores of each node. That is even though >> there is nowhere near enough memory on any of these nodes for >> 8*3700mb=29600mb. Considering the physical memory limit goes from 8GB >> to 24GB depending upon the node, this is just taking down nodes left >> and right. >> >> Below I have provided a small example along with the associated output. >> I also provided the output for pbsnodes in case there is something I am >> missing here. >> >> Thanks for your help! -Lance >> >> torque version: tried 2.5.4, 2.5.8, and 3.0.2 - all exhibit the same >> problem. >> maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete fail >> in terms of the procs option and it only asks for a single CPU) >> >> $ cat tmp.pbs >> #!/bin/bash >> #PBS -S /bin/bash >> #PBS -l procs=24 >> #PBS -l pmem=3700mb >> #PBS -l walltime=6:00:00 >> #PBS -j oe >> >> cat $PBS_NODEFILE >> >> $ qsub tmp.pbs >> 337003.XXXX >> $ wc -l tmp.pbs.o337003 >> 24 tmp.pbs.o337003 >> $ cat tmp.pbs.o337003 >> compute-0-14 >> compute-0-14 >> compute-0-14 >> compute-0-14 >> compute-0-14 >> compute-0-14 >> compute-0-14 >> compute-0-14 >> compute-0-15 >> compute-0-15 >> compute-0-15 >> compute-0-15 >> compute-0-15 >> compute-0-15 >> compute-0-15 >> compute-0-15 >> compute-0-16 >> compute-0-16 >> compute-0-16 >> compute-0-16 >> compute-0-16 >> compute-0-16 >> compute-0-16 >> compute-0-16 >> >> $ pbsnodes -a >> compute-0-16 >> state = free >> np = 8 >> ntype = cluster >> status = >> rectime=1319219085,varattr=,jobs=,state=free,netload=1834011936,gres=,l >> oadave=0.00,ncpus=8,physmem=8177300kb,availmem=10095652kb,totmem=102255 >> 76kb,idletime=5582,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux >> compute-0-16.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT >> 2011 x86_64,opsys=linux >> gpus = 0 >> >> compute-0-15 >> state = free >> np = 8 >> ntype = cluster >> status = >> rectime=1319219090,varattr=,jobs=,state=free,netload=700017694,gres=,lo >> adave=0.00,ncpus=8,physmem=8177300kb,availmem=10150996kb,totmem=1022557 >> 6kb,idletime=5606,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux >> compute-0-15.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT >> 2011 x86_64,opsys=linux >> gpus = 0 >> >> compute-0-14 >> state = free >> np = 8 >> ntype = cluster >> status = >> rectime=1319219090,varattr=,jobs=,state=free,netload=1003164957,gres=,l >> oadave=0.00,ncpus=8,physmem=8177300kb,availmem=10131180kb,totmem=102255 >> 76kb,idletime=5615,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux >> compute-0-14.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT >> 2011 x86_64,opsys=linux >> gpus = 0 >> >> compute-0-13 >> state = free >> np = 8 >> ntype = cluster >> status = >> rectime=1319219090,varattr=,jobs=,state=free,netload=1173266470,gres=,l >> oadave=0.00,ncpus=8,physmem=8177300kb,availmem=10132104kb,totmem=102255 >> 76kb,idletime=5637,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux >> compute-0-13.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT >> 2011 x86_64,opsys=linux >> gpus = 0 >> >> compute-0-12 >> state = free >> np = 8 >> ntype = cluster >> status = >> rectime=1319219090,varattr=,jobs=,state=free,netload=3991477,gres=,load >> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14276448kb,totmem=14350232 >> kb,idletime=5604,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux >> compute-0-12.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT >> 2011 x86_64,opsys=linux >> gpus = 0 >> >> compute-0-11 >> state = free >> np = 8 >> ntype = cluster >> status = >> rectime=1319219090,varattr=,jobs=,state=free,netload=2947879,gres=,load >> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14274604kb,totmem=14350232 >> kb,idletime=5588,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux >> compute-0-11.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT >> 2011 x86_64,opsys=linux >> gpus = 0 >> >> compute-0-9 >> state = free >> np = 8 >> ntype = cluster >> status = >> rectime=1319219090,varattr=,jobs=,state=free,netload=3721396,gres=,load >> ave=0.05,ncpus=8,physmem=12301956kb,availmem=14253816kb,totmem=14350232 >> kb,idletime=5660,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux >> compute-0-9.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT >> 2011 x86_64,opsys=linux >> gpus = 0 >> >> compute-0-8 >> state = free >> np = 8 >> ntype = cluster >> status = >> rectime=1319219090,varattr=,jobs=,state=free,netload=2934478,gres=,load >> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14254796kb,totmem=14350232 >> kb,idletime=5675,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux >> compute-0-8.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT >> 2011 x86_64,opsys=linux >> gpus = 0 >> >> compute-0-7 >> state = free >> np = 8 >> ntype = cluster >> status = >> rectime=1319219090,varattr=,jobs=,state=free,netload=2909406,gres=,load >> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14254812kb,totmem=14350232 >> kb,idletime=5489,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux >> compute-0-7.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT >> 2011 x86_64,opsys=linux >> gpus = 0 >> >> compute-0-6 >> state = free >> np = 8 >> ntype = cluster >> status = >> rectime=1319219090,varattr=,jobs=,state=free,netload=2936791,gres=,load >> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14275644kb,totmem=14350232 >> kb,idletime=5748,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux >> compute-0-6.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT >> 2011 x86_64,opsys=linux >> gpus = 0 >> >> compute-0-5 >> state = free >> np = 8 >> ntype = cluster >> status = >> rectime=1319219090,varattr=,jobs=,state=free,netload=2966183,gres=,load >> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14276260kb,totmem=14350232 >> kb,idletime=5695,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux >> compute-0-5.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT >> 2011 x86_64,opsys=linux >> gpus = 0 >> >> compute-0-4 >> state = free >> np = 8 >> ntype = cluster >> status = >> rectime=1319219090,varattr=,jobs=,state=free,netload=2886627,gres=,load >> ave=0.00,ncpus=8,physmem=16438900kb,availmem=18412332kb,totmem=18487176 >> kb,idletime=5634,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux >> compute-0-4.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT >> 2011 x86_64,opsys=linux >> gpus = 0 >> >> compute-0-3 >> state = free >> np = 8 >> properties = lustre >> ntype = cluster >> status = >> rectime=1319219108,varattr=,jobs=,state=free,netload=436527254,gres=,lo >> adave=0.00,ncpus=8,physmem=24688212kb,availmem=26636656kb,totmem=267364 >> 88kb,idletime=2224,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux >> compute-0-3.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT >> 2011 x86_64,opsys=linux >> gpus = 0 >> >> compute-0-2 >> state = free >> np = 8 >> properties = lustre >> ntype = cluster >> status = >> rectime=1319219106,varattr=,jobs=,state=free,netload=1184385,gres=,load >> ave=0.00,ncpus=8,physmem=24688212kb,availmem=26659668kb,totmem=26736488 >> kb,idletime=2223,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux >> compute-0-2.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT >> 2011 x86_64,opsys=linux >> gpus = 0 >> >> compute-0-1 >> state = free >> np = 8 >> properties = lustre >> ntype = cluster >> status = >> rectime=1319219102,varattr=,jobs=,state=free,netload=1258074,gres=,load >> ave=0.00,ncpus=8,physmem=24688212kb,availmem=26657304kb,totmem=26736488 >> kb,idletime=2228,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux >> compute-0-1.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT >> 2011 x86_64,opsys=linux >> gpus = 0 >> >> compute-0-0 >> state = free >> np = 8 >> ntype = cluster >> status = >> rectime=1319219090,varattr=,jobs=,state=free,netload=3416356,gres=,load >> ave=0.00,ncpus=8,physmem=24688212kb,availmem=26635624kb,totmem=26736488 >> kb,idletime=5603,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux >> compute-0-0.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT >> 2011 x86_64,opsys=linux >> gpus = 0 >> >> compute-0-10 >> state = free >> np = 2 >> ntype = cluster >> status = >> rectime=1319219090,varattr=,jobs=,state=free,netload=283846193,gres=,lo >> adave=0.23,ncpus=8,physmem=12301956kb,availmem=13762696kb,totmem=143502 >> 32kb,idletime=5622,nusers=1,nsessions=1,sessions=3410,uname=Linux >> compute-0-10.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT >> 2011 x86_64,opsys=linux >> gpus = 0 >> >> compute-0-17 >> state = free >> np = 8 >> properties = testbox >> ntype = cluster >> status = >> rectime=1319219090,varattr=,jobs=,state=free,netload=2948331,gres=,load >> ave=0.00,ncpus=8,physmem=8177300kb,availmem=10144432kb,totmem=10225576k >> b,idletime=5558,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux >> compute-0-17.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT >> 2011 x86_64,opsys=linux >> gpus = 0 >> >> > >_______________________________________________ >mauiusers mailing list >mauiusers at supercluster.org >http://www.supercluster.org/mailman/listinfo/mauiusers From roy.dragseth at cc.uit.no Mon Nov 7 15:56:10 2011 From: roy.dragseth at cc.uit.no (Roy Dragseth) Date: Mon, 7 Nov 2011 23:56:10 +0100 Subject: [Mauiusers] Strange queue/scheduler issue In-Reply-To: References: Message-ID: <201111072356.10182.roy.dragseth@cc.uit.no> Have you tried to set a limitation on the number of Idle jobs allowed per user? For instance USRCFG[DEFAULT] MAXIJOB=10 r. On Monday 7. November 2011 23.44.47 Ian Miller wrote: > Not sure if this is the correct forum for this but > We have 320 core Grid with Maui & torque running. Three queues are setup > up with two nodes (24 core) for one of them and another with two > exclusively and three nodes sharing the default queue. > When someone submit say 4000 jobs to the default queue. No one can submit > any jobs to either of the other queue. They just sit in Q status. This > started about three days ago and the users and total in an uproar about > it. > > Any thought would on where to find the bottle neck of a config setting > would be helpful. > > -I > > > > Ian Miller > System Administrator > ianm at uchicago.edu > 312-282-6507 > > > > > > > On 10/26/11 6:07 PM, "Gareth.Williams at csiro.au" > > wrote: > >Hi Lance, > > > >Does maui locate appropriate nodes if you specify: > >-l procs=24,vmem=29600mb > >? > >That's what I'd do. It will not limit the memory per process (loosely > >speaking) but the main problem is which nodes are allocated. > > > >Gareth > > > >> -----Original Message----- > >> From: Lance Westerhoff [mailto:lance at quantumbioinc.com] > >> Sent: Thursday, 27 October 2011 2:31 AM > >> To: mauiusers at supercluster.org > >> Subject: [Mauiusers] torque/maui disregarding pmem with procs > >> > >> > >> Hello all- > >> > >> (I sent this email to the torque list, but I'm wondering if it might be > >> a maui problem). > >> > >> We are trying to use procs= and pmem= on an 18 node (152core) cluster > >> with nodes of various memory size. pbsnodes shows the correct memory > >> complement for each node, so apparently PBS is getting the right specs > >> (see the output of pbsnodes below for more information). If we use the > >> following settings in the PBS script, invariably torque/maui will try > >> to fill up the all 8 of the 8 cores of each node. That is even though > >> there is nowhere near enough memory on any of these nodes for > >> 8*3700mb=29600mb. Considering the physical memory limit goes from 8GB > >> to 24GB depending upon the node, this is just taking down nodes left > >> and right. > >> > >> Below I have provided a small example along with the associated output. > >> I also provided the output for pbsnodes in case there is something I am > >> missing here. > >> > >> Thanks for your help! -Lance > >> > >> torque version: tried 2.5.4, 2.5.8, and 3.0.2 - all exhibit the same > >> problem. > >> maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete fail > >> in terms of the procs option and it only asks for a single CPU) > >> > >> $ cat tmp.pbs > >> #!/bin/bash > >> #PBS -S /bin/bash > >> #PBS -l procs=24 > >> #PBS -l pmem=3700mb > >> #PBS -l walltime=6:00:00 > >> #PBS -j oe > >> > >> cat $PBS_NODEFILE > >> > >> $ qsub tmp.pbs > >> 337003.XXXX > >> $ wc -l tmp.pbs.o337003 > >> 24 tmp.pbs.o337003 > >> $ cat tmp.pbs.o337003 > >> compute-0-14 > >> compute-0-14 > >> compute-0-14 > >> compute-0-14 > >> compute-0-14 > >> compute-0-14 > >> compute-0-14 > >> compute-0-14 > >> compute-0-15 > >> compute-0-15 > >> compute-0-15 > >> compute-0-15 > >> compute-0-15 > >> compute-0-15 > >> compute-0-15 > >> compute-0-15 > >> compute-0-16 > >> compute-0-16 > >> compute-0-16 > >> compute-0-16 > >> compute-0-16 > >> compute-0-16 > >> compute-0-16 > >> compute-0-16 > >> > >> $ pbsnodes -a > >> compute-0-16 > >> > >> state = free > >> np = 8 > >> ntype = cluster > >> status = > >> > >> rectime=1319219085,varattr=,jobs=,state=free,netload=1834011936,gres=,l > >> oadave=0.00,ncpus=8,physmem=8177300kb,availmem=10095652kb,totmem=102255 > >> 76kb,idletime=5582,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux > >> compute-0-16.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT > >> 2011 x86_64,opsys=linux > >> > >> gpus = 0 > >> > >> compute-0-15 > >> > >> state = free > >> np = 8 > >> ntype = cluster > >> status = > >> > >> rectime=1319219090,varattr=,jobs=,state=free,netload=700017694,gres=,lo > >> adave=0.00,ncpus=8,physmem=8177300kb,availmem=10150996kb,totmem=1022557 > >> 6kb,idletime=5606,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux > >> compute-0-15.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT > >> 2011 x86_64,opsys=linux > >> > >> gpus = 0 > >> > >> compute-0-14 > >> > >> state = free > >> np = 8 > >> ntype = cluster > >> status = > >> > >> rectime=1319219090,varattr=,jobs=,state=free,netload=1003164957,gres=,l > >> oadave=0.00,ncpus=8,physmem=8177300kb,availmem=10131180kb,totmem=102255 > >> 76kb,idletime=5615,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux > >> compute-0-14.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT > >> 2011 x86_64,opsys=linux > >> > >> gpus = 0 > >> > >> compute-0-13 > >> > >> state = free > >> np = 8 > >> ntype = cluster > >> status = > >> > >> rectime=1319219090,varattr=,jobs=,state=free,netload=1173266470,gres=,l > >> oadave=0.00,ncpus=8,physmem=8177300kb,availmem=10132104kb,totmem=102255 > >> 76kb,idletime=5637,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux > >> compute-0-13.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT > >> 2011 x86_64,opsys=linux > >> > >> gpus = 0 > >> > >> compute-0-12 > >> > >> state = free > >> np = 8 > >> ntype = cluster > >> status = > >> > >> rectime=1319219090,varattr=,jobs=,state=free,netload=3991477,gres=,load > >> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14276448kb,totmem=14350232 > >> kb,idletime=5604,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux > >> compute-0-12.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT > >> 2011 x86_64,opsys=linux > >> > >> gpus = 0 > >> > >> compute-0-11 > >> > >> state = free > >> np = 8 > >> ntype = cluster > >> status = > >> > >> rectime=1319219090,varattr=,jobs=,state=free,netload=2947879,gres=,load > >> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14274604kb,totmem=14350232 > >> kb,idletime=5588,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux > >> compute-0-11.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT > >> 2011 x86_64,opsys=linux > >> > >> gpus = 0 > >> > >> compute-0-9 > >> > >> state = free > >> np = 8 > >> ntype = cluster > >> status = > >> > >> rectime=1319219090,varattr=,jobs=,state=free,netload=3721396,gres=,load > >> ave=0.05,ncpus=8,physmem=12301956kb,availmem=14253816kb,totmem=14350232 > >> kb,idletime=5660,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux > >> compute-0-9.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT > >> 2011 x86_64,opsys=linux > >> > >> gpus = 0 > >> > >> compute-0-8 > >> > >> state = free > >> np = 8 > >> ntype = cluster > >> status = > >> > >> rectime=1319219090,varattr=,jobs=,state=free,netload=2934478,gres=,load > >> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14254796kb,totmem=14350232 > >> kb,idletime=5675,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux > >> compute-0-8.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT > >> 2011 x86_64,opsys=linux > >> > >> gpus = 0 > >> > >> compute-0-7 > >> > >> state = free > >> np = 8 > >> ntype = cluster > >> status = > >> > >> rectime=1319219090,varattr=,jobs=,state=free,netload=2909406,gres=,load > >> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14254812kb,totmem=14350232 > >> kb,idletime=5489,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux > >> compute-0-7.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT > >> 2011 x86_64,opsys=linux > >> > >> gpus = 0 > >> > >> compute-0-6 > >> > >> state = free > >> np = 8 > >> ntype = cluster > >> status = > >> > >> rectime=1319219090,varattr=,jobs=,state=free,netload=2936791,gres=,load > >> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14275644kb,totmem=14350232 > >> kb,idletime=5748,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux > >> compute-0-6.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT > >> 2011 x86_64,opsys=linux > >> > >> gpus = 0 > >> > >> compute-0-5 > >> > >> state = free > >> np = 8 > >> ntype = cluster > >> status = > >> > >> rectime=1319219090,varattr=,jobs=,state=free,netload=2966183,gres=,load > >> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14276260kb,totmem=14350232 > >> kb,idletime=5695,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux > >> compute-0-5.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT > >> 2011 x86_64,opsys=linux > >> > >> gpus = 0 > >> > >> compute-0-4 > >> > >> state = free > >> np = 8 > >> ntype = cluster > >> status = > >> > >> rectime=1319219090,varattr=,jobs=,state=free,netload=2886627,gres=,load > >> ave=0.00,ncpus=8,physmem=16438900kb,availmem=18412332kb,totmem=18487176 > >> kb,idletime=5634,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux > >> compute-0-4.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT > >> 2011 x86_64,opsys=linux > >> > >> gpus = 0 > >> > >> compute-0-3 > >> > >> state = free > >> np = 8 > >> properties = lustre > >> ntype = cluster > >> status = > >> > >> rectime=1319219108,varattr=,jobs=,state=free,netload=436527254,gres=,lo > >> adave=0.00,ncpus=8,physmem=24688212kb,availmem=26636656kb,totmem=267364 > >> 88kb,idletime=2224,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux > >> compute-0-3.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT > >> 2011 x86_64,opsys=linux > >> > >> gpus = 0 > >> > >> compute-0-2 > >> > >> state = free > >> np = 8 > >> properties = lustre > >> ntype = cluster > >> status = > >> > >> rectime=1319219106,varattr=,jobs=,state=free,netload=1184385,gres=,load > >> ave=0.00,ncpus=8,physmem=24688212kb,availmem=26659668kb,totmem=26736488 > >> kb,idletime=2223,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux > >> compute-0-2.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT > >> 2011 x86_64,opsys=linux > >> > >> gpus = 0 > >> > >> compute-0-1 > >> > >> state = free > >> np = 8 > >> properties = lustre > >> ntype = cluster > >> status = > >> > >> rectime=1319219102,varattr=,jobs=,state=free,netload=1258074,gres=,load > >> ave=0.00,ncpus=8,physmem=24688212kb,availmem=26657304kb,totmem=26736488 > >> kb,idletime=2228,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux > >> compute-0-1.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT > >> 2011 x86_64,opsys=linux > >> > >> gpus = 0 > >> > >> compute-0-0 > >> > >> state = free > >> np = 8 > >> ntype = cluster > >> status = > >> > >> rectime=1319219090,varattr=,jobs=,state=free,netload=3416356,gres=,load > >> ave=0.00,ncpus=8,physmem=24688212kb,availmem=26635624kb,totmem=26736488 > >> kb,idletime=5603,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux > >> compute-0-0.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT > >> 2011 x86_64,opsys=linux > >> > >> gpus = 0 > >> > >> compute-0-10 > >> > >> state = free > >> np = 2 > >> ntype = cluster > >> status = > >> > >> rectime=1319219090,varattr=,jobs=,state=free,netload=283846193,gres=,lo > >> adave=0.23,ncpus=8,physmem=12301956kb,availmem=13762696kb,totmem=143502 > >> 32kb,idletime=5622,nusers=1,nsessions=1,sessions=3410,uname=Linux > >> compute-0-10.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT > >> 2011 x86_64,opsys=linux > >> > >> gpus = 0 > >> > >> compute-0-17 > >> > >> state = free > >> np = 8 > >> properties = testbox > >> ntype = cluster > >> status = > >> > >> rectime=1319219090,varattr=,jobs=,state=free,netload=2948331,gres=,load > >> ave=0.00,ncpus=8,physmem=8177300kb,availmem=10144432kb,totmem=10225576k > >> b,idletime=5558,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux > >> compute-0-17.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT > >> 2011 x86_64,opsys=linux > >> > >> gpus = 0 > > > >_______________________________________________ > >mauiusers mailing list > >mauiusers at supercluster.org > >http://www.supercluster.org/mailman/listinfo/mauiusers > > _______________________________________________ > mauiusers mailing list > mauiusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/mauiusers -- The Computer Center, University of Troms?, N-9037 TROMS? Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no From scrusan at ur.rochester.edu Mon Nov 7 16:02:04 2011 From: scrusan at ur.rochester.edu (Steve Crusan) Date: Mon, 7 Nov 2011 18:02:04 -0500 Subject: [Mauiusers] Strange queue/scheduler issue In-Reply-To: <201111072356.10182.roy.dragseth@cc.uit.no> References: <201111072356.10182.roy.dragseth@cc.uit.no> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Nov 7, 2011, at 5:56 PM, Roy Dragseth wrote: > Have you tried to set a limitation on the number of Idle jobs allowed per > user? > > For instance > > USRCFG[DEFAULT] MAXIJOB=10 > Correct above. Also, you can define the max amount of jobs queued via TORQUE, I think max_user_queuable. I've run into a situation where a user submitted too many jobs for the scheduler to handle (job wouldn't even be given a state), and I was required to set a something in MAUI for max jobs. Try Roy's fix above first though. The TORQUE limit is great for a hard and easy to see user limit, i.e. they get feedback when they break the rules. > r. > > > On Monday 7. November 2011 23.44.47 Ian Miller wrote: >> Not sure if this is the correct forum for this but >> We have 320 core Grid with Maui & torque running. Three queues are setup >> up with two nodes (24 core) for one of them and another with two >> exclusively and three nodes sharing the default queue. >> When someone submit say 4000 jobs to the default queue. No one can submit >> any jobs to either of the other queue. They just sit in Q status. This >> started about three days ago and the users and total in an uproar about >> it. >> >> Any thought would on where to find the bottle neck of a config setting >> would be helpful. >> >> -I >> >> >> >> Ian Miller >> System Administrator >> ianm at uchicago.edu >> 312-282-6507 >> >> >> >> >> >> >> On 10/26/11 6:07 PM, "Gareth.Williams at csiro.au" >> >> wrote: >>> Hi Lance, >>> >>> Does maui locate appropriate nodes if you specify: >>> -l procs=24,vmem=29600mb >>> ? >>> That's what I'd do. It will not limit the memory per process (loosely >>> speaking) but the main problem is which nodes are allocated. >>> >>> Gareth >>> >>>> -----Original Message----- >>>> From: Lance Westerhoff [mailto:lance at quantumbioinc.com] >>>> Sent: Thursday, 27 October 2011 2:31 AM >>>> To: mauiusers at supercluster.org >>>> Subject: [Mauiusers] torque/maui disregarding pmem with procs >>>> >>>> >>>> Hello all- >>>> >>>> (I sent this email to the torque list, but I'm wondering if it might be >>>> a maui problem). >>>> >>>> We are trying to use procs= and pmem= on an 18 node (152core) cluster >>>> with nodes of various memory size. pbsnodes shows the correct memory >>>> complement for each node, so apparently PBS is getting the right specs >>>> (see the output of pbsnodes below for more information). If we use the >>>> following settings in the PBS script, invariably torque/maui will try >>>> to fill up the all 8 of the 8 cores of each node. That is even though >>>> there is nowhere near enough memory on any of these nodes for >>>> 8*3700mb=29600mb. Considering the physical memory limit goes from 8GB >>>> to 24GB depending upon the node, this is just taking down nodes left >>>> and right. >>>> >>>> Below I have provided a small example along with the associated output. >>>> I also provided the output for pbsnodes in case there is something I am >>>> missing here. >>>> >>>> Thanks for your help! -Lance >>>> >>>> torque version: tried 2.5.4, 2.5.8, and 3.0.2 - all exhibit the same >>>> problem. >>>> maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete fail >>>> in terms of the procs option and it only asks for a single CPU) >>>> >>>> $ cat tmp.pbs >>>> #!/bin/bash >>>> #PBS -S /bin/bash >>>> #PBS -l procs=24 >>>> #PBS -l pmem=3700mb >>>> #PBS -l walltime=6:00:00 >>>> #PBS -j oe >>>> >>>> cat $PBS_NODEFILE >>>> >>>> $ qsub tmp.pbs >>>> 337003.XXXX >>>> $ wc -l tmp.pbs.o337003 >>>> 24 tmp.pbs.o337003 >>>> $ cat tmp.pbs.o337003 >>>> compute-0-14 >>>> compute-0-14 >>>> compute-0-14 >>>> compute-0-14 >>>> compute-0-14 >>>> compute-0-14 >>>> compute-0-14 >>>> compute-0-14 >>>> compute-0-15 >>>> compute-0-15 >>>> compute-0-15 >>>> compute-0-15 >>>> compute-0-15 >>>> compute-0-15 >>>> compute-0-15 >>>> compute-0-15 >>>> compute-0-16 >>>> compute-0-16 >>>> compute-0-16 >>>> compute-0-16 >>>> compute-0-16 >>>> compute-0-16 >>>> compute-0-16 >>>> compute-0-16 >>>> >>>> $ pbsnodes -a >>>> compute-0-16 >>>> >>>> state = free >>>> np = 8 >>>> ntype = cluster >>>> status = >>>> >>>> rectime=1319219085,varattr=,jobs=,state=free,netload=1834011936,gres=,l >>>> oadave=0.00,ncpus=8,physmem=8177300kb,availmem=10095652kb,totmem=102255 >>>> 76kb,idletime=5582,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux >>>> compute-0-16.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT >>>> 2011 x86_64,opsys=linux >>>> >>>> gpus = 0 >>>> >>>> compute-0-15 >>>> >>>> state = free >>>> np = 8 >>>> ntype = cluster >>>> status = >>>> >>>> rectime=1319219090,varattr=,jobs=,state=free,netload=700017694,gres=,lo >>>> adave=0.00,ncpus=8,physmem=8177300kb,availmem=10150996kb,totmem=1022557 >>>> 6kb,idletime=5606,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux >>>> compute-0-15.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT >>>> 2011 x86_64,opsys=linux >>>> >>>> gpus = 0 >>>> >>>> compute-0-14 >>>> >>>> state = free >>>> np = 8 >>>> ntype = cluster >>>> status = >>>> >>>> rectime=1319219090,varattr=,jobs=,state=free,netload=1003164957,gres=,l >>>> oadave=0.00,ncpus=8,physmem=8177300kb,availmem=10131180kb,totmem=102255 >>>> 76kb,idletime=5615,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux >>>> compute-0-14.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT >>>> 2011 x86_64,opsys=linux >>>> >>>> gpus = 0 >>>> >>>> compute-0-13 >>>> >>>> state = free >>>> np = 8 >>>> ntype = cluster >>>> status = >>>> >>>> rectime=1319219090,varattr=,jobs=,state=free,netload=1173266470,gres=,l >>>> oadave=0.00,ncpus=8,physmem=8177300kb,availmem=10132104kb,totmem=102255 >>>> 76kb,idletime=5637,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux >>>> compute-0-13.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT >>>> 2011 x86_64,opsys=linux >>>> >>>> gpus = 0 >>>> >>>> compute-0-12 >>>> >>>> state = free >>>> np = 8 >>>> ntype = cluster >>>> status = >>>> >>>> rectime=1319219090,varattr=,jobs=,state=free,netload=3991477,gres=,load >>>> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14276448kb,totmem=14350232 >>>> kb,idletime=5604,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux >>>> compute-0-12.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT >>>> 2011 x86_64,opsys=linux >>>> >>>> gpus = 0 >>>> >>>> compute-0-11 >>>> >>>> state = free >>>> np = 8 >>>> ntype = cluster >>>> status = >>>> >>>> rectime=1319219090,varattr=,jobs=,state=free,netload=2947879,gres=,load >>>> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14274604kb,totmem=14350232 >>>> kb,idletime=5588,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux >>>> compute-0-11.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT >>>> 2011 x86_64,opsys=linux >>>> >>>> gpus = 0 >>>> >>>> compute-0-9 >>>> >>>> state = free >>>> np = 8 >>>> ntype = cluster >>>> status = >>>> >>>> rectime=1319219090,varattr=,jobs=,state=free,netload=3721396,gres=,load >>>> ave=0.05,ncpus=8,physmem=12301956kb,availmem=14253816kb,totmem=14350232 >>>> kb,idletime=5660,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux >>>> compute-0-9.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT >>>> 2011 x86_64,opsys=linux >>>> >>>> gpus = 0 >>>> >>>> compute-0-8 >>>> >>>> state = free >>>> np = 8 >>>> ntype = cluster >>>> status = >>>> >>>> rectime=1319219090,varattr=,jobs=,state=free,netload=2934478,gres=,load >>>> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14254796kb,totmem=14350232 >>>> kb,idletime=5675,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux >>>> compute-0-8.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT >>>> 2011 x86_64,opsys=linux >>>> >>>> gpus = 0 >>>> >>>> compute-0-7 >>>> >>>> state = free >>>> np = 8 >>>> ntype = cluster >>>> status = >>>> >>>> rectime=1319219090,varattr=,jobs=,state=free,netload=2909406,gres=,load >>>> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14254812kb,totmem=14350232 >>>> kb,idletime=5489,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux >>>> compute-0-7.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT >>>> 2011 x86_64,opsys=linux >>>> >>>> gpus = 0 >>>> >>>> compute-0-6 >>>> >>>> state = free >>>> np = 8 >>>> ntype = cluster >>>> status = >>>> >>>> rectime=1319219090,varattr=,jobs=,state=free,netload=2936791,gres=,load >>>> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14275644kb,totmem=14350232 >>>> kb,idletime=5748,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux >>>> compute-0-6.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT >>>> 2011 x86_64,opsys=linux >>>> >>>> gpus = 0 >>>> >>>> compute-0-5 >>>> >>>> state = free >>>> np = 8 >>>> ntype = cluster >>>> status = >>>> >>>> rectime=1319219090,varattr=,jobs=,state=free,netload=2966183,gres=,load >>>> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14276260kb,totmem=14350232 >>>> kb,idletime=5695,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux >>>> compute-0-5.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT >>>> 2011 x86_64,opsys=linux >>>> >>>> gpus = 0 >>>> >>>> compute-0-4 >>>> >>>> state = free >>>> np = 8 >>>> ntype = cluster >>>> status = >>>> >>>> rectime=1319219090,varattr=,jobs=,state=free,netload=2886627,gres=,load >>>> ave=0.00,ncpus=8,physmem=16438900kb,availmem=18412332kb,totmem=18487176 >>>> kb,idletime=5634,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux >>>> compute-0-4.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT >>>> 2011 x86_64,opsys=linux >>>> >>>> gpus = 0 >>>> >>>> compute-0-3 >>>> >>>> state = free >>>> np = 8 >>>> properties = lustre >>>> ntype = cluster >>>> status = >>>> >>>> rectime=1319219108,varattr=,jobs=,state=free,netload=436527254,gres=,lo >>>> adave=0.00,ncpus=8,physmem=24688212kb,availmem=26636656kb,totmem=267364 >>>> 88kb,idletime=2224,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux >>>> compute-0-3.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT >>>> 2011 x86_64,opsys=linux >>>> >>>> gpus = 0 >>>> >>>> compute-0-2 >>>> >>>> state = free >>>> np = 8 >>>> properties = lustre >>>> ntype = cluster >>>> status = >>>> >>>> rectime=1319219106,varattr=,jobs=,state=free,netload=1184385,gres=,load >>>> ave=0.00,ncpus=8,physmem=24688212kb,availmem=26659668kb,totmem=26736488 >>>> kb,idletime=2223,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux >>>> compute-0-2.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT >>>> 2011 x86_64,opsys=linux >>>> >>>> gpus = 0 >>>> >>>> compute-0-1 >>>> >>>> state = free >>>> np = 8 >>>> properties = lustre >>>> ntype = cluster >>>> status = >>>> >>>> rectime=1319219102,varattr=,jobs=,state=free,netload=1258074,gres=,load >>>> ave=0.00,ncpus=8,physmem=24688212kb,availmem=26657304kb,totmem=26736488 >>>> kb,idletime=2228,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux >>>> compute-0-1.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT >>>> 2011 x86_64,opsys=linux >>>> >>>> gpus = 0 >>>> >>>> compute-0-0 >>>> >>>> state = free >>>> np = 8 >>>> ntype = cluster >>>> status = >>>> >>>> rectime=1319219090,varattr=,jobs=,state=free,netload=3416356,gres=,load >>>> ave=0.00,ncpus=8,physmem=24688212kb,availmem=26635624kb,totmem=26736488 >>>> kb,idletime=5603,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux >>>> compute-0-0.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT >>>> 2011 x86_64,opsys=linux >>>> >>>> gpus = 0 >>>> >>>> compute-0-10 >>>> >>>> state = free >>>> np = 2 >>>> ntype = cluster >>>> status = >>>> >>>> rectime=1319219090,varattr=,jobs=,state=free,netload=283846193,gres=,lo >>>> adave=0.23,ncpus=8,physmem=12301956kb,availmem=13762696kb,totmem=143502 >>>> 32kb,idletime=5622,nusers=1,nsessions=1,sessions=3410,uname=Linux >>>> compute-0-10.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT >>>> 2011 x86_64,opsys=linux >>>> >>>> gpus = 0 >>>> >>>> compute-0-17 >>>> >>>> state = free >>>> np = 8 >>>> properties = testbox >>>> ntype = cluster >>>> status = >>>> >>>> rectime=1319219090,varattr=,jobs=,state=free,netload=2948331,gres=,load >>>> ave=0.00,ncpus=8,physmem=8177300kb,availmem=10144432kb,totmem=10225576k >>>> b,idletime=5558,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux >>>> compute-0-17.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT >>>> 2011 x86_64,opsys=linux >>>> >>>> gpus = 0 >>> >>> _______________________________________________ >>> mauiusers mailing list >>> mauiusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/mauiusers >> >> _______________________________________________ >> mauiusers mailing list >> mauiusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/mauiusers > > -- > > The Computer Center, University of Troms?, N-9037 TROMS? Norway. > phone:+47 77 64 41 07, fax:+47 77 64 41 00 > Roy Dragseth, Team Leader, High Performance Computing > Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no > _______________________________________________ > mauiusers mailing list > mauiusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/mauiusers ---------------------- Steve Crusan System Administrator Center for Research Computing University of Rochester https://www.crc.rochester.edu/ -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.17 (Darwin) Comment: GPGTools - http://gpgtools.org iQEcBAEBAgAGBQJOuGN0AAoJENS19LGOpgqKwBAIAJ6vEibED4jiVxlCteZq2Jx2 enqAHhYMlFMgt/k+UKUWcsnwNkNUA0Z7SvsSipuavFLjdUphnr9dBIuwAJWir8NO 3OnDCzEkBe4m4i8a65pK3kurnwqAgqGU7i17+5RVw27yOfmmAU4IWBWE1N3Ql27m 4RLcA13BeBQcrPh4ig6lDTv9KL4xWO/g245rqNIgAmkb2jGAUzAKGcgL2qcDhxZg 1n8MSD0qbQ1n15Co9ETiZ3lEy+Ni/Nbi+3SBNm2yYmdpWWEYLABN3MZlPigzPOxA acqnn1MyHJsuypA7c9/L+H9fB1lGZ1XGCiZKKBUzVajTYfTtvbeJXMhlZxg+9sM= =Xkv5 -----END PGP SIGNATURE----- From jpeltier at sfu.ca Mon Nov 7 18:26:16 2011 From: jpeltier at sfu.ca (James A. Peltier) Date: Mon, 7 Nov 2011 17:26:16 -0800 (PST) Subject: [Mauiusers] Strange queue/scheduler issue In-Reply-To: Message-ID: <743305110.721781.1320715576065.JavaMail.root@jaguar10.sfu.ca> ----- Original Message ----- | Not sure if this is the correct forum for this but | We have 320 core Grid with Maui & torque running. Three queues are | setup | up with two nodes (24 core) for one of them and another with two | exclusively and three nodes sharing the default queue. | When someone submit say 4000 jobs to the default queue. No one can | submit | any jobs to either of the other queue. They just sit in Q status. This | started about three days ago and the users and total in an uproar | about | it. | | Any thought would on where to find the bottle neck of a config setting | would be helpful. | | -I I think you are looking for this. http://www.adaptivecomputing.com/resources/docs/maui/a.ddevelopment.php Specifically... Value : MMAX_JOB File : moab.h Default: 4096 maximum total number of simultaneous idle/active jobs allowed. NOTE: on some releases of Maui, MAX_MJOB may also need to be set and synchronized with MMAX_JOB. You need to recompile Maui in order for it to be able to evaluate more than 4096 jobs. This needs to be tweaked on larger clusters. Change that to something like 32768 if you have a really large cluster. Keep in mind that this slows scheduler job eligibility evaluations due to increased job count. -- James A. Peltier IT Services - Research Computing Group Simon Fraser University - Burnaby Campus Phone : 778-782-6573 Fax : 778-782-3045 E-Mail : jpeltier at sfu.ca Website : http://www.sfu.ca/itservices http://blogs.sfu.ca/people/jpeltier I will do the best I can with the talent I have From raub at uni-duesseldorf.de Tue Nov 8 08:40:45 2011 From: raub at uni-duesseldorf.de (Dr. Stephan Raub) Date: Tue, 08 Nov 2011 16:40:45 +0100 Subject: [Mauiusers] Possible Memory Corruption in maui Message-ID: <002601cc9e2c$c7a3c450$56eb4cf0$@de> Dear fellow maui users, we are running Maui 3.3.1 with torque 2.3.7 under RHEL5.5 (2.6.8-194.26.1.el1) on a 600-somewhat core cluster. We experienced a sudden death of the maui scheduler with no message in the logs. We could not figure out a reason so we attached an "strace" to the maui process (as long as it was "still alive") and we got: select(1024, [9], NULL, NULL, {0, 10000}) = 0 (Timeout) accept(6, 0x7fff9bda9210, [11230449699255222288]) = -1 EAGAIN (Resource temporarily unavailable) select(0, NULL, NULL, NULL, {0, 100000}) = 0 (Timeout) stat("/var/log/maui.log", {st_mode=S_IFREG|0640, st_size=9598210, ...}) = 0 write(3, "11/08 16:20:49 MPBSClusterQuery("..., 45) = 45 write(8, "+2+12+58+4root+0+0+0", 20) = 20 poll([{fd=8, events=POLLIN|POLLHUP}], 1, 9000) = 1 ([{fd=8, revents=POLLIN}]) fcntl(8, F_GETFL) = 0x2 (flags O_RDWR) read(8, "+2+1+0+0+63+115+3+6gauss1+32+11+"..., 262144) = 62514 open("/dev/tty", O_RDWR|O_NOCTTY|O_NONBLOCK) = -1 ENXIO (No such device or address) writev(2, [{"*** glibc detected *** ", 23}, {"/usr/local/maui/sbin/maui", 25}, {": ", 2}, {"malloc(): memory corruption", 27}, {": 0x", 4}, {"0000000012bbff80", 16}, {" ***\n", 5}], 7) = 102 open("/usr/local/torque/lib/libgcc_s.so.1", O_RDONLY) = -1 ENOENT (No such file or directory) open("tls/x86_64/libgcc_s.so.1", O_RDONLY) = -1 ENOENT (No such file or directory) open("tls/libgcc_s.so.1", O_RDONLY) = -1 ENOENT (No such file or directory) open("x86_64/libgcc_s.so.1", O_RDONLY) = -1 ENOENT (No such file or directory) open("libgcc_s.so.1", O_RDONLY) = -1 ENOENT (No such file or directory) open("/usr/local/torque/lib/libgcc_s.so.1", O_RDONLY) = -1 ENOENT (No such file or directory) open("/usr/local/maui/lib/libgcc_s.so.1", O_RDONLY) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY) = 10 fstat(10, {st_mode=S_IFREG|0644, st_size=101081, ...}) = 0 mmap(NULL, 101081, PROT_READ, MAP_PRIVATE, 10, 0) = 0x2ae371556000 close(10) = 0 open("/lib64/libgcc_s.so.1", O_RDONLY) = 10 read(10, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P\36\240\0033\0\0\0"..., 832) = 832 mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x2ae37156f000 munmap(0x2ae37156f000, 44634112) = 0 munmap(0x2ae378000000, 22474752) = 0 mprotect(0x2ae374000000, 135168, PROT_READ|PROT_WRITE) = 0 fstat(10, {st_mode=S_IFREG|0755, st_size=58400, ...}) = 0 mmap(0x3303a00000, 2151784, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 10, 0) = 0x3303a00000 mprotect(0x3303a0d000, 2097152, PROT_NONE) = 0 mmap(0x3303c0d000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 10, 0xd000) = 0x3303c0d000 close(10) = 0 munmap(0x2ae371556000, 101081) = 0 write(2, "======= Backtrace: =========\n", 29) = 29 writev(2, [{"/lib64/libc.so.6", 16}, {"[0x", 3}, {"3300672fae", 10}, {"]\n", 2}], 4) = 31 writev(2, [{"/lib64/libc.so.6", 16}, {"(", 1}, {"__libc_malloc", 13}, {"+0x", 3}, {"6e", 2}, {")", 1}, {"[0x", 3}, {"3300674cde", 10}, {"]\n", 2}], 9) = 51 writev(2, [{"/usr/local/torque/lib/libtorque."..., 36}, {"(", 1}, {"disrst", 6}, {"+0x", 3}, {"fd", 2}, {")", 1}, {"[0x", 3}, {"2ae3709c987d", 12}, {"]\n", 2}], 9) = 66 writev(2, [{"/usr/local/torque/lib/libtorque."..., 36}, {"(", 1}, {"decode_DIS_attrl", 16}, {"+0x", 3}, {"c4", 2}, {")", 1}, {"[0x", 3}, {"2ae3709caa64", 12}, {"]\n", 2}], 9) = 76 writev(2, [{"/usr/local/torque/lib/libtorque."..., 36}, {"(", 1}, {"decode_DIS_replyCmd", 19}, {"+0x", 3}, {"23d", 3}, {")", 1}, {"[0x", 3}, {"2ae3709cb8bd", 12}, {"]\n", 2}], 9) = 80 writev(2, [{"/usr/local/torque/lib/libtorque."..., 36}, {"(", 1}, {"PBSD_rdrpy", 10}, {"+0x", 3}, {"80", 2}, {")", 1}, {"[0x", 3}, {"2ae3709cf6d0", 12}, {"]\n", 2}], 9) = 70 writev(2, [{"/usr/local/torque/lib/libtorque."..., 36}, {"(", 1}, {"PBSD_status_get", 15}, {"+0x", 3}, {"26", 2}, {")", 1}, {"[0x", 3}, {"2ae3709d0786", 12}, {"]\n", 2}], 9) = 75 writev(2, [{"/usr/local/maui/sbin/maui", 25}, {"[0x", 3}, {"49f501", 6}, {"]\n", 2}], 4) = 36 writev(2, [{"/usr/local/maui/sbin/maui", 25}, {"[0x", 3}, {"45c399", 6}, {"]\n", 2}], 4) = 36 writev(2, [{"/usr/local/maui/sbin/maui", 25}, {"[0x", 3}, {"45e8c0", 6}, {"]\n", 2}], 4) = 36 writev(2, [{"/usr/local/maui/sbin/maui", 25}, {"[0x", 3}, {"493054", 6}, {"]\n", 2}], 4) = 36 writev(2, [{"/usr/local/maui/sbin/maui", 25}, {"[0x", 3}, {"49683d", 6}, {"]\n", 2}], 4) = 36 writev(2, [{"/usr/local/maui/sbin/maui", 25}, {"[0x", 3}, {"42a266", 6}, {"]\n", 2}], 4) = 36 writev(2, [{"/usr/local/maui/sbin/maui", 25}, {"[0x", 3}, {"40591d", 6}, {"]\n", 2}], 4) = 36 writev(2, [{"/lib64/libc.so.6", 16}, {"(", 1}, {"__libc_start_main", 17}, {"+0x", 3}, {"f4", 2}, {")", 1}, {"[0x", 3}, {"330061d994", 10}, {"]\n", 2}], 9) = 55 writev(2, [{"/usr/local/maui/sbin/maui", 25}, {"[0x", 3}, {"402c39", 6}, {"]\n", 2}], 4) = 36 write(2, "======= Memory map: ========\n", 29) = 29 open("/proc/self/maps", O_RDONLY) = 10 read(10, "00400000-004fc000 r-xp 00000000 "..., 1024) = 1024 write(2, "00400000-004fc000 r-xp 00000000 "..., 1024) = 1024 read(10, "0 r-xp 00000000 08:03 18186457 "..., 1024) = 1024 write(2, "0 r-xp 00000000 08:03 18186457 "..., 1024) = 1024 read(10, "_s-4.1.2-20080825.so.1\n3304a0000"..., 1024) = 1024 write(2, "_s-4.1.2-20080825.so.1\n3304a0000"..., 1024) = 1024 read(10, "3 18186474 "..., 1024) = 1024 write(2, "3 18186474 "..., 1024) = 1024 read(10, "ae370be5000-2ae370be7000 rw-p 00"..., 1024) = 1024 write(2, "ae370be5000-2ae370be7000 rw-p 00"..., 1024) = 1024 read(10, "ff9bb04000-7fff9bdb4000 rw-p 7ff"..., 1024) = 159 write(2, "ff9bb04000-7fff9bdb4000 rw-p 7ff"..., 159) = 159 read(10, "", 1024) = 0 close(10) = 0 rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0 tgkill(17084, 17084, SIGABRT) = 0 --- SIGABRT (Aborted) @ 0 (0) --- Process 17084 detached The only error message we see is a "memory corruption" after a MPBSClusterQuery() call. Is this a known problem? How can we fix this? Thank you in advance. Best regards. -- --------------------------------------------------------- | | Dr. rer. nat. Stephan Raub | | Dipl. Chem. | | High-Performance-Computing | | Zentrum f?r Informations- und Medientechnologie | | Heinrich-Heine-Universit?t D?sseldorf | | Universit?tsstr. 1 / Raum 25.41.O2.25-2 | | 40225 D?sseldorf / Germany | | | | Tel: +49-211-811-3911 | | Fax: +49-211-811-2539 --------------------------------------------------------- Wichtiger Hinweis: Diese E-Mail kann Betriebs- oder Gesch?ftsgeheimnisse, bzw. sonstige vertrauliche Informationen enthalten. Sollten Sie diese E-Mail irrt?mlich erhalten haben, ist Ihnen eine Kenntnisnahme des Inhalts, eine Vervielf?ltigung oder Weitergabe der E-Mail ausdr?cklich untersagt. Bitte benachrichtigen Sie uns und vernichten Sie die empfangene E-Mail. Vielen Dank. Important Note: This e-mail may contain trade secrets or privileged, undisclosed or otherwise confidential information. If you have received this e-mail in error, you are hereby notified that any review, copying or distribution of it is strictly prohibited. Please inform us immediately and destroy the original transmittal. Thank you for your cooperation. From akohlmey at cmm.chem.upenn.edu Tue Nov 8 09:10:33 2011 From: akohlmey at cmm.chem.upenn.edu (Axel Kohlmeyer) Date: Tue, 8 Nov 2011 11:10:33 -0500 Subject: [Mauiusers] Possible Memory Corruption in maui In-Reply-To: <002601cc9e2c$c7a3c450$56eb4cf0$@de> References: <002601cc9e2c$c7a3c450$56eb4cf0$@de> Message-ID: 2011/11/8 Dr. Stephan Raub : > Dear fellow maui users, > > we are running Maui 3.3.1 with torque 2.3.7 under RHEL5.5 > (2.6.8-194.26.1.el1) on a 600-somewhat core cluster. > > We experienced a sudden death of the maui scheduler with no message in the > logs. We could not figure out a reason so we attached an "strace" to the > maui process (as long as it was "still alive") and we got: > [... > The only error message we see is a "memory corruption" after a > MPBSClusterQuery() call. > > Is this a known problem? How can we fix this? this is a stab in the dark, but one known problem is that there are some code segments on maui, that violate the aliasing requirements in C, and thus maui should be compiled with -fno-strict-aliasing . for gcc 3.x this was the default but it was changed for -fstrict-aliasing by default on gcc 4.x. cheers, axel. > > Thank you in advance. > Best regards. > -- > --------------------------------------------------------- > | | Dr. rer. nat. Stephan Raub > | | Dipl. Chem. > | | High-Performance-Computing > | | Zentrum f?r Informations- und Medientechnologie > | | Heinrich-Heine-Universit?t D?sseldorf > | | Universit?tsstr. 1 / Raum 25.41.O2.25-2 > | | 40225 D?sseldorf / Germany > | | > | | Tel: +49-211-811-3911 > | | Fax: +49-211-811-2539 > --------------------------------------------------------- > > Wichtiger Hinweis: Diese E-Mail kann Betriebs- oder Gesch?ftsgeheimnisse, > bzw. > sonstige vertrauliche Informationen enthalten. Sollten Sie diese E-Mail > irrt?mlich erhalten haben, ist Ihnen eine Kenntnisnahme des Inhalts, eine > Vervielf?ltigung oder Weitergabe der E-Mail ausdr?cklich untersagt. Bitte > benachrichtigen Sie uns und vernichten Sie die empfangene E-Mail. Vielen > Dank. > > Important Note: This e-mail may contain trade secrets or privileged, > undisclosed or otherwise confidential information. If you have received this > e-mail in error, you are hereby notified that any review, copying or > distribution of it is strictly prohibited. Please inform us immediately and > destroy the original transmittal. Thank you for your cooperation. > > > > > _______________________________________________ > mauiusers mailing list > mauiusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/mauiusers > -- Dr. Axel Kohlmeyer? ? akohlmey at gmail.com http://sites.google.com/site/akohlmey/ Institute for Computational Molecular Science Temple University, Philadelphia PA, USA. From raub at uni-duesseldorf.de Tue Nov 8 10:09:47 2011 From: raub at uni-duesseldorf.de (Dr. Stephan Raub) Date: Tue, 08 Nov 2011 18:09:47 +0100 Subject: [Mauiusers] Possible Memory Corruption in maui In-Reply-To: <4EB95446.7000908@sara.nl> References: <002601cc9e2c$c7a3c450$56eb4cf0$@de> <4EB95446.7000908@sara.nl> Message-ID: <003301cc9e39$387f40d0$a97dc270$@de> Dear Mr. van der Vlies Currently we have 6095 Jobs queued and 93 Jobs running. Amoung these, we have some large job arrays (1000 and 4000 items per array). Best regards. -- --------------------------------------------------------- | | Dr. rer. nat. Stephan Raub | | Dipl. Chem. | | High-Performance-Computing | | Zentrum f?r Informations- und Medientechnologie | | Heinrich-Heine-Universit?t D?sseldorf | | Universit?tsstr. 1 / Raum 25.41.O2.25-2 | | 40225 D?sseldorf / Germany | | | | Tel: +49-211-811-3911 | | Fax: +49-211-811-2539 --------------------------------------------------------- Wichtiger Hinweis: Diese E-Mail kann Betriebs- oder Gesch?ftsgeheimnisse, bzw. sonstige vertrauliche Informationen enthalten. Sollten Sie diese E-Mail irrt?mlich erhalten haben, ist Ihnen eine Kenntnisnahme des Inhalts, eine Vervielf?ltigung oder Weitergabe der E-Mail ausdr?cklich untersagt. Bitte benachrichtigen Sie uns und vernichten Sie die empfangene E-Mail. Vielen Dank. Important Note: This e-mail may contain trade secrets or privileged, undisclosed or otherwise confidential information. If you have received this e-mail in error, you are hereby notified that any review, copying or distribution of it is strictly prohibited. Please inform us immediately and destroy the original transmittal. Thank you for your cooperation. > -----Urspr?ngliche Nachricht----- > Von: Bas van der Vlies [mailto:basv at sara.nl] > Gesendet: Dienstag, 8. November 2011 17:10 > An: Dr. Stephan Raub > Betreff: Re: [Mauiusers] Possible Memory Corruption in maui > > On 08-11-11 16:40, Dr. Stephan Raub wrote: > > Dear fellow maui users, > > > > we are running Maui 3.3.1 with torque 2.3.7 under RHEL5.5 > > (2.6.8-194.26.1.el1) on a 600-somewhat core cluster. > > > > We experienced a sudden death of the maui scheduler with no message > in the > > logs. We could not figure out a reason so we attached an "strace" to > the > > maui process (as long as it was "still alive") and we got: > > > > Dear Dr. Stephan Raub, > > just a question: How many jobs are in the queue? > > regards > > > -- > ******************************************************************** > * Bas van der Vlies e-mail: basv at sara.nl * > * SARA - Academic Computing Services Amsterdam, The Netherlands * > ******************************************************************** From fcaba at uns.edu.ar Tue Nov 8 14:58:54 2011 From: fcaba at uns.edu.ar (Fernando Caba) Date: Tue, 08 Nov 2011 18:58:54 -0300 Subject: [Mauiusers] jobs assigned to same cores in the same node? Message-ID: <4EB9A61E.1080104@uns.edu.ar> Hi mauiusers, i have a job that it is assigned to node10, from cores 0 to 3 and another job assigned to the same node and to the same identical cores (o to 3) Somebody have any idea what is happening? I have torque-3.0.1 and maui-3.3.1. Thanks -- ---------------------------------------------------- Ing. Fernando Caba Director General de Telecomunicaciones Universidad Nacional del Sur http://www.dgt.uns.edu.ar Tel/Fax: (54)-291-4595166 Tel: (54)-291-4595101 int. 2050 Avda. Alem 1253, (B8000CPB) Bah?a Blanca - Argentina ---------------------------------------------------- From jasonw at Jhu.edu Tue Nov 8 15:50:27 2011 From: jasonw at Jhu.edu (Jason Williams) Date: Tue, 08 Nov 2011 17:50:27 -0500 Subject: [Mauiusers] Possible Memory Corruption in maui In-Reply-To: <003301cc9e39$387f40d0$a97dc270$@de> References: <002601cc9e2c$c7a3c450$56eb4cf0$@de> <4EB95446.7000908@sara.nl> <003301cc9e39$387f40d0$a97dc270$@de> Message-ID: <4EB9B233.9030307@Jhu.edu> Dr Stephan Raub, Maui does have some very odd "memory management" in it that has a tendency to cause these types of crashes when run in high volume situations without some tweaks and/or concessions. I've tracked down, and I think fixed, one in the latest svn trunk, but 3.3.1 should already have that fix in it. Can/have you tried running maui from the command line with the -d line and catching the corrupt memory and back trace that comes out of it? Your original email has the strace, but it cuts off some of the backtrace. I might be able to see where in the code it's having problems, if I can get the full back trace. -- Jason Williams Systems Engineer Homewood High Performance Cluster Johns Hopkins University On 11/8/2011 12:09 PM, Dr. Stephan Raub wrote: > Dear Mr. van der Vlies > > Currently we have 6095 Jobs queued and 93 Jobs running. Amoung these, we > have some large job arrays (1000 and 4000 items per array). > > Best regards. > -- > --------------------------------------------------------- > | | Dr. rer. nat. Stephan Raub > | | Dipl. Chem. > | | High-Performance-Computing > | | Zentrum f?r Informations- und Medientechnologie > | | Heinrich-Heine-Universit?t D?sseldorf > | | Universit?tsstr. 1 / Raum 25.41.O2.25-2 > | | 40225 D?sseldorf / Germany > | | > | | Tel: +49-211-811-3911 > | | Fax: +49-211-811-2539 > --------------------------------------------------------- > > Wichtiger Hinweis: Diese E-Mail kann Betriebs- oder Gesch?ftsgeheimnisse, > bzw. > sonstige vertrauliche Informationen enthalten. Sollten Sie diese E-Mail > irrt?mlich erhalten haben, ist Ihnen eine Kenntnisnahme des Inhalts, eine > Vervielf?ltigung oder Weitergabe der E-Mail ausdr?cklich untersagt. Bitte > benachrichtigen Sie uns und vernichten Sie die empfangene E-Mail. Vielen > Dank. > > Important Note: This e-mail may contain trade secrets or privileged, > undisclosed or otherwise confidential information. If you have received this > e-mail in error, you are hereby notified that any review, copying or > distribution of it is strictly prohibited. Please inform us immediately and > destroy the original transmittal. Thank you for your cooperation. > > >> -----Urspr?ngliche Nachricht----- >> Von: Bas van der Vlies [mailto:basv at sara.nl] >> Gesendet: Dienstag, 8. November 2011 17:10 >> An: Dr. Stephan Raub >> Betreff: Re: [Mauiusers] Possible Memory Corruption in maui >> >> On 08-11-11 16:40, Dr. Stephan Raub wrote: >>> Dear fellow maui users, >>> >>> we are running Maui 3.3.1 with torque 2.3.7 under RHEL5.5 >>> (2.6.8-194.26.1.el1) on a 600-somewhat core cluster. >>> >>> We experienced a sudden death of the maui scheduler with no message >> in the >>> logs. We could not figure out a reason so we attached an "strace" to >> the >>> maui process (as long as it was "still alive") and we got: >>> >> Dear Dr. Stephan Raub, >> >> just a question: How many jobs are in the queue? >> >> regards >> >> >> -- >> ******************************************************************** >> * Bas van der Vlies e-mail: basv at sara.nl * >> * SARA - Academic Computing Services Amsterdam, The Netherlands * >> ******************************************************************** > > > _______________________________________________ > mauiusers mailing list > mauiusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/mauiusers From fcaba at uns.edu.ar Tue Nov 8 15:50:59 2011 From: fcaba at uns.edu.ar (Fernando Caba) Date: Tue, 08 Nov 2011 19:50:59 -0300 Subject: [Mauiusers] jobs assigned to same cores in the same node? In-Reply-To: References: <4EB9A61E.1080104@uns.edu.ar> Message-ID: <4EB9B253.3090803@uns.edu.ar> Estimado Hector, gracias por tu pronta respuesta. El problema es que cuando en el cluster hubo actividad de varios jobs, un usuario larg? primero un c?lculo en un job y luego de unas 5 horas larg? otro. El problema fue que ambos jobs fueron a parar a el mismo nodo y los mismos cores. Parte del comando qstat: [root at fe ~]# qstat -f 477 Job Id: 477.fe Job_Name = job_gr_PBE Job_Owner = matias at fe job_state = Q queue = batch server = fe Checkpoint = u ctime = Tue Nov 8 11:58:16 2011 Error_Path = fe:/usr/home/matias/graf/graf-graf-PBE-VdW/job_gr_PBE.e477 exec_host = n10/3+n10/2+n10/1+n10/0 exec_port = 15003+15003+15003+15003 [root at fe ~]# qstat -f 480 Job Id: 480.fe Job_Name = job_gr_PBE Job_Owner = matias at fe job_state = Q queue = batch server = fe Checkpoint = u ctime = Tue Nov 8 17:26:09 2011 Error_Path = fe:/usr/home/matias/graf/graf-graf-PBE-VdW/job_gr_PBE.e480 exec_host = n10/3+n10/2+n10/1+n10/0 exec_port = 15003+15003+15003+15003 esto me da el comando tracejob para ambos job: [root at fe ~]# tracejob 480 /var/spool/torque/mom_logs/20111108: No such file or directory /var/spool/torque/sched_logs/20111108: No such file or directory Job: 480.fe 11/08/2011 17:26:09 S enqueuing into batch, state 1 hop 1 11/08/2011 17:26:09 S Job Queued at request of matias at fe, owner = matias at fe, job name = job_gr_PBE, queue = batch 11/08/2011 17:26:09 A queue=batch 11/08/2011 17:26:10 S Job Run at request of root at fe 11/08/2011 17:26:12 S unable to run job, MOM rejected/rc=2 11/08/2011 18:26:36 S Job Run at request of root at fe 11/08/2011 18:26:38 S unable to run job, MOM rejected/rc=2 11/08/2011 19:26:45 S Job Run at request of root at fe 11/08/2011 19:26:45 S Not sending email: User does not want mail of this type. 11/08/2011 19:26:45 A user=matias group=matias jobname=job_gr_PBE queue=batch ctime=1320783969 qtime=1320783969 etime=1320783969 start=1320791205 owner=matias at fe exec_host=n11/7+n11/6+n11/5+n11/4 Resource_List.neednodes=1:ppn=4 Resource_List.nodect=1 Resource_List.nodes=1:ppn=4 Resource_List.walltime=2400:00:00 11/08/2011 19:26:53 S Not sending email: User does not want mail of this type. 11/08/2011 19:26:53 S Exit_status=0 resources_used.cput=00:00:27 resources_used.mem=0kb resources_used.vmem=0kb resources_used.walltime=00:00:09 11/08/2011 19:26:53 A user=matias group=matias jobname=job_gr_PBE queue=batch ctime=1320783969 qtime=1320783969 etime=1320783969 start=1320791205 owner=matias at fe exec_host=n11/7+n11/6+n11/5+n11/4 Resource_List.neednodes=1:ppn=4 Resource_List.nodect=1 Resource_List.nodes=1:ppn=4 Resource_List.walltime=2400:00:00 session=8035 end=1320791213 Exit_status=0 resources_used.cput=00:00:27 resources_used.mem=0kb resources_used.vmem=0kb resources_used.walltime=00:00:09 11/08/2011 19:31:53 S dequeuing from batch, state COMPLETE [root at fe ~]# [root at fe ~]# tracejob 477 /var/spool/torque/mom_logs/20111108: No such file or directory /var/spool/torque/sched_logs/20111108: No such file or directory Job: 477.fe 11/08/2011 11:58:16 S enqueuing into batch, state 1 hop 1 11/08/2011 11:58:16 S Job Queued at request of matias at fe, owner = matias at fe, job name = job_gr_PBE, queue = batch 11/08/2011 11:58:16 A queue=batch 11/08/2011 11:58:17 S Job Run at request of root at fe 11/08/2011 11:58:19 S unable to run job, MOM rejected/rc=2 11/08/2011 12:58:34 S Job Run at request of root at fe 11/08/2011 12:58:36 S unable to run job, MOM rejected/rc=2 11/08/2011 13:58:37 S Job Run at request of root at fe 11/08/2011 13:58:39 S unable to run job, MOM rejected/rc=2 11/08/2011 14:58:43 S Job Run at request of root at fe 11/08/2011 14:58:45 S unable to run job, MOM rejected/rc=2 11/08/2011 15:59:09 S Job Run at request of root at fe 11/08/2011 15:59:11 S unable to run job, MOM rejected/rc=2 11/08/2011 16:59:30 S Job Run at request of root at fe 11/08/2011 16:59:32 S unable to run job, MOM rejected/rc=2 11/08/2011 17:59:50 S Job Run at request of root at fe 11/08/2011 17:59:52 S unable to run job, MOM rejected/rc=2 11/08/2011 19:00:02 S Job Run at request of root at fe 11/08/2011 19:00:02 S Not sending email: User does not want mail of this type. 11/08/2011 19:00:02 A user=matias group=matias jobname=job_gr_PBE queue=batch ctime=1320764296 qtime=1320764296 etime=1320764296 start=1320789602 owner=matias at fe exec_host=n11/7+n11/6+n11/5+n11/4 Resource_List.neednodes=1:ppn=4 Resource_List.nodect=1 Resource_List.nodes=1:ppn=4 Resource_List.walltime=2400:00:00 11/08/2011 19:00:10 S Not sending email: User does not want mail of this type. 11/08/2011 19:00:10 S Exit_status=0 resources_used.cput=00:00:27 resources_used.mem=0kb resources_used.vmem=0kb resources_used.walltime=00:00:09 11/08/2011 19:00:10 A user=matias group=matias jobname=job_gr_PBE queue=batch ctime=1320764296 qtime=1320764296 etime=1320764296 start=1320789602 owner=matias at fe exec_host=n11/7+n11/6+n11/5+n11/4 Resource_List.neednodes=1:ppn=4 Resource_List.nodect=1 Resource_List.nodes=1:ppn=4 Resource_List.walltime=2400:00:00 session=7936 end=1320789610 Exit_status=0 resources_used.cput=00:00:27 resources_used.mem=0kb resources_used.vmem=0kb resources_used.walltime=00:00:09 11/08/2011 19:05:11 S dequeuing from batch, state COMPLETE [root at fe ~]# no entiendo porque no est?n los logs en los directorios /var/spool/torque/mom_logs ni /var/spool/torque/sched_logs Saludos Fernando ---------------------------------------------------- Ing. Fernando Caba Director General de Telecomunicaciones Universidad Nacional del Sur http://www.dgt.uns.edu.ar Tel/Fax: (54)-291-4595166 Tel: (54)-291-4595101 int. 2050 Avda. Alem 1253, (B8000CPB) Bah?a Blanca - Argentina ---------------------------------------------------- El 08/11/2011 07:17 PM, Hector Oliver escribi?: > Cual es el estado de los jobs (tracejob #job)?? > los dos te aparecen en el qstat? > se permite en tu configuraci?n varios jobs a la ves? > > On Tue, Nov 8, 2011 at 3:58 PM, Fernando Caba > wrote: > > Hi mauiusers, i have a job that it is assigned to node10, from cores 0 > to 3 and another job assigned to the same node and to the same > identical > cores (o to 3) > Somebody have any idea what is happening? I have torque-3.0.1 and > maui-3.3.1. > Thanks > > -- > ---------------------------------------------------- > Ing. Fernando Caba > Director General de Telecomunicaciones > Universidad Nacional del Sur > http://www.dgt.uns.edu.ar > Tel/Fax: (54)-291-4595166 > Tel: (54)-291-4595101 int. 2050 > Avda. Alem 1253, (B8000CPB) Bah?a Blanca - Argentina > ---------------------------------------------------- > > _______________________________________________ > mauiusers mailing list > mauiusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/mauiusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20111108/3ec130ad/attachment-0001.html From j.blank at fz-juelich.de Tue Nov 8 09:35:19 2011 From: j.blank at fz-juelich.de (Joerg Blank) Date: Tue, 08 Nov 2011 17:35:19 +0100 Subject: [Mauiusers] Possible Memory Corruption in maui In-Reply-To: <002601cc9e2c$c7a3c450$56eb4cf0$@de> References: <002601cc9e2c$c7a3c450$56eb4cf0$@de> Message-ID: Hello, > Is this a known problem? How can we fix this? I'm experiencing a similiar problems with maui. Our current 'fix' for the problem is to use supervisord to restart the process. IMHO some of those crashes are correlated with a memory corruption within torque. From time to time it begins to report copies from random addresses of its heap in the 'jobs' field of 'pbsnodes -a', which should be the same information maui aquires in the ClusterQuery query call. I'm not sure this is the only reason for these crashes. Also we are using the Torque 2.5.8/2.5.9 so this might not be the same problem. Regards, J?rg Blank From j.blank at fz-juelich.de Wed Nov 9 00:51:32 2011 From: j.blank at fz-juelich.de (Joerg Blank) Date: Wed, 09 Nov 2011 08:51:32 +0100 Subject: [Mauiusers] Possible Memory Corruption in maui In-Reply-To: <4EB9B233.9030307@Jhu.edu> References: <002601cc9e2c$c7a3c450$56eb4cf0$@de> <4EB95446.7000908@sara.nl> <003301cc9e39$387f40d0$a97dc270$@de> <4EB9B233.9030307@Jhu.edu> Message-ID: Hello, > Can/have you tried running maui from the command line with the -d line > and catching the corrupt memory and back trace that comes out of it? I'm not entirely sure if we are experiences the same type of crashes but it certainly looks like. You can find a month worth of debug output at http://www.4geeks.de/files/maui-stderr.log I noticed that these crashes seem to be correlated with a (dangling pointer?) bug in torque which leads to random heap data included in the "jobs" field of "pbsnodes -a" (AFAIK this is the same as MPBSClusterQuery). Maui crashes also tend to happen if someone uses array-jobs. Regards, Joerg Blank From raub at uni-duesseldorf.de Wed Nov 9 02:38:23 2011 From: raub at uni-duesseldorf.de (Dr. Stephan Raub) Date: Wed, 09 Nov 2011 10:38:23 +0100 Subject: [Mauiusers] Possible Memory Corruption in maui In-Reply-To: <4EB9B233.9030307@Jhu.edu> References: <002601cc9e2c$c7a3c450$56eb4cf0$@de> <4EB95446.7000908@sara.nl> <003301cc9e39$387f40d0$a97dc270$@de> <4EB9B233.9030307@Jhu.edu> Message-ID: <000701cc9ec3$53777210$fa665630$@de> Dear Jason Williams, thank you for your hint. Please, find below the result of our Maui running with the "-d" command line option (maui was running about 5 minutes before it crashed): # /usr/local/maui/sbin/maui -d *** glibc detected *** /usr/local/maui/sbin/maui: malloc(): memory corruption: 0x00000000099243e0 *** ======= Backtrace: ========= /lib64/libc.so.6[0x3300672fae] /lib64/libc.so.6(__libc_malloc+0x6e)[0x3300674cde] /usr/local/torque/lib/libtorque.so.2(decode_DIS_replyCmd+0x266)[0x2ab278cb18 e6] /usr/local/torque/lib/libtorque.so.2(PBSD_rdrpy+0x80)[0x2ab278cb56d0] /usr/local/torque/lib/libtorque.so.2(PBSD_status_get+0x26)[0x2ab278cb6786] /usr/local/maui/sbin/maui[0x4d9e59] /usr/local/maui/sbin/maui[0x48b8e4] /usr/local/maui/sbin/maui[0x48b84f] /usr/local/maui/sbin/maui[0x4ce81c] /usr/local/maui/sbin/maui[0x4ce39e] /usr/local/maui/sbin/maui[0x4419eb] /usr/local/maui/sbin/maui[0x403608] /lib64/libc.so.6(__libc_start_main+0xf4)[0x330061d994] /usr/local/maui/sbin/maui[0x402cd9] ======= Memory map: ======== 00400000-0054f000 r-xp 00000000 08:03 50266128 /usr/local/maui/sbin/maui 0074f000-00754000 rw-p 0014f000 08:03 50266128 /usr/local/maui/sbin/maui 00754000-02344000 rw-p 00754000 00:00 0 0984b000-188f1000 rw-p 0984b000 00:00 0 [heap] 3300200000-330021c000 r-xp 00000000 08:03 18186265 /lib64/ld-2.5.so 330041b000-330041c000 r--p 0001b000 08:03 18186265 /lib64/ld-2.5.so 330041c000-330041d000 rw-p 0001c000 08:03 18186265 /lib64/ld-2.5.so 3300600000-330074e000 r-xp 00000000 08:03 18186304 /lib64/libc-2.5.so 330074e000-330094d000 ---p 0014e000 08:03 18186304 /lib64/libc-2.5.so 330094d000-3300951000 r--p 0014d000 08:03 18186304 /lib64/libc-2.5.so 3300951000-3300952000 rw-p 00151000 08:03 18186304 /lib64/libc-2.5.so 3300952000-3300957000 rw-p 3300952000 00:00 0 3300a00000-3300a02000 r-xp 00000000 08:03 18186457 /lib64/libdl-2.5.so 3300a02000-3300c02000 ---p 00002000 08:03 18186457 /lib64/libdl-2.5.so 3300c02000-3300c03000 r--p 00002000 08:03 18186457 /lib64/libdl-2.5.so 3300c03000-3300c04000 rw-p 00003000 08:03 18186457 /lib64/libdl-2.5.so 3300e00000-3300e82000 r-xp 00000000 08:03 18186543 /lib64/libm-2.5.so 3300e82000-3301081000 ---p 00082000 08:03 18186543 /lib64/libm-2.5.so 3301081000-3301082000 r--p 00081000 08:03 18186543 /lib64/libm-2.5.so 3301082000-3301083000 rw-p 00082000 08:03 18186543 /lib64/libm-2.5.so 3303a00000-3303a0d000 r-xp 00000000 08:03 18186545 /lib64/libgcc_s-4.1.2-20080825.so.1 3303a0d000-3303c0d000 ---p 0000d000 08:03 18186545 /lib64/libgcc_s-4.1.2-20080825.so.1 3303c0d000-3303c0e000 rw-p 0000d000 08:03 18186545 /lib64/libgcc_s-4.1.2-20080825.so.1 3304a00000-3304a15000 r-xp 00000000 08:03 18186491 /lib64/libselinux.so.1 3304a15000-3304c15000 ---p 00015000 08:03 18186491 /lib64/libselinux.so.1 3304c15000-3304c17000 rw-p 00015000 08:03 18186491 /lib64/libselinux.so.1 3304c17000-3304c18000 rw-p 3304c17000 00:00 0 3304e00000-3304e3b000 r-xp 00000000 08:03 18186479 /lib64/libsepol.so.1 3304e3b000-330503b000 ---p 0003b000 08:03 18186479 /lib64/libsepol.so.1 330503b000-330503c000 rw-p 0003b000 08:03 18186479 /lib64/libsepol.so.1 330503c000-3305046000 rw-p 330503c000 00:00 0 3305e00000-3305e02000 r-xp 00000000 08:03 18186469 /lib64/libkeyutils-1.3.so 3305e02000-3306001000 ---p 00002000 08:03 18186469 /lib64/libkeyutils-1.3.so 3306001000-3306002000 rw-p 00001000 08:03 18186469 /lib64/libkeyutils-1.3.so 3306200000-3306211000 r-xp 00000000 08:03 18186474 /lib64/libresolv-2.5.so 3306211000-3306411000 ---p 00011000 08:03 18186474 /lib64/libresolv-2.5.so 3306411000-3306412000 r--p 00Aborted Thank you for your efforts. Stephan -- --------------------------------------------------------- | | Dr. rer. nat. Stephan Raub | | Dipl. Chem. | | High-Performance-Computing | | Zentrum f?r Informations- und Medientechnologie | | Heinrich-Heine-Universit?t D?sseldorf | | Universit?tsstr. 1 / Raum 25.41.O2.25-2 | | 40225 D?sseldorf / Germany | | | | Tel: +49-211-811-3911 | | Fax: +49-211-811-2539 --------------------------------------------------------- Wichtiger Hinweis: Diese E-Mail kann Betriebs- oder Gesch?ftsgeheimnisse, bzw. sonstige vertrauliche Informationen enthalten. Sollten Sie diese E-Mail irrt?mlich erhalten haben, ist Ihnen eine Kenntnisnahme des Inhalts, eine Vervielf?ltigung oder Weitergabe der E-Mail ausdr?cklich untersagt. Bitte benachrichtigen Sie uns und vernichten Sie die empfangene E-Mail. Vielen Dank. Important Note: This e-mail may contain trade secrets or privileged, undisclosed or otherwise confidential information. If you have received this e-mail in error, you are hereby notified that any review, copying or distribution of it is strictly prohibited. Please inform us immediately and destroy the original transmittal. Thank you for your cooperation. > -----Urspr?ngliche Nachricht----- > Von: mauiusers-bounces at supercluster.org [mailto:mauiusers- > bounces at supercluster.org] Im Auftrag von Jason Williams > Gesendet: Dienstag, 8. November 2011 23:50 > An: mauiusers at supercluster.org > Betreff: Re: [Mauiusers] Possible Memory Corruption in maui > > Dr Stephan Raub, > > Maui does have some very odd "memory management" in it that has a > tendency to cause these types of crashes when run in high volume > situations without some tweaks and/or concessions. I've tracked down, > and I think fixed, one in the latest svn trunk, but 3.3.1 should > already have that fix in it. > > Can/have you tried running maui from the command line with the -d line > and catching the corrupt memory and back trace that comes out of it? > Your original email has the strace, but it cuts off some of the > backtrace. I might be able to see where in the code it's having > problems, if I can get the full back trace. > > > -- > Jason Williams > Systems Engineer > Homewood High Performance Cluster > Johns Hopkins University > > On 11/8/2011 12:09 PM, Dr. Stephan Raub wrote: > > Dear Mr. van der Vlies > > > > Currently we have 6095 Jobs queued and 93 Jobs running. Amoung these, > > we have some large job arrays (1000 and 4000 items per array). > > > > Best regards. > > -- > > --------------------------------------------------------- > > | | Dr. rer. nat. Stephan Raub > > | | Dipl. Chem. > > | | High-Performance-Computing > > | | Zentrum f?r Informations- und Medientechnologie > > | | Heinrich-Heine-Universit?t D?sseldorf Universit?tsstr. 1 / Raum > > | | 25.41.O2.25-2 > > | | 40225 D?sseldorf / Germany > > | | > > | | Tel: +49-211-811-3911 > > | | Fax: +49-211-811-2539 > > --------------------------------------------------------- > > > > Wichtiger Hinweis: Diese E-Mail kann Betriebs- oder > > Gesch?ftsgeheimnisse, bzw. > > sonstige vertrauliche Informationen enthalten. Sollten Sie diese > > E-Mail irrt?mlich erhalten haben, ist Ihnen eine Kenntnisnahme des > > Inhalts, eine Vervielf?ltigung oder Weitergabe der E-Mail > ausdr?cklich > > untersagt. Bitte benachrichtigen Sie uns und vernichten Sie die > > empfangene E-Mail. Vielen Dank. > > > > Important Note: This e-mail may contain trade secrets or privileged, > > undisclosed or otherwise confidential information. If you have > > received this e-mail in error, you are hereby notified that any > > review, copying or distribution of it is strictly prohibited. Please > > inform us immediately and destroy the original transmittal. Thank you > for your cooperation. > > > > > >> -----Urspr?ngliche Nachricht----- > >> Von: Bas van der Vlies [mailto:basv at sara.nl] > >> Gesendet: Dienstag, 8. November 2011 17:10 > >> An: Dr. Stephan Raub > >> Betreff: Re: [Mauiusers] Possible Memory Corruption in maui > >> > >> On 08-11-11 16:40, Dr. Stephan Raub wrote: > >>> Dear fellow maui users, > >>> > >>> we are running Maui 3.3.1 with torque 2.3.7 under RHEL5.5 > >>> (2.6.8-194.26.1.el1) on a 600-somewhat core cluster. > >>> > >>> We experienced a sudden death of the maui scheduler with no message > >> in the > >>> logs. We could not figure out a reason so we attached an "strace" > to > >> the > >>> maui process (as long as it was "still alive") and we got: > >>> > >> Dear Dr. Stephan Raub, > >> > >> just a question: How many jobs are in the queue? > >> > >> regards > >> > >> > >> -- > >> ******************************************************************** > >> * Bas van der Vlies e-mail: basv at sara.nl * > >> * SARA - Academic Computing Services Amsterdam, The Netherlands * > >> ******************************************************************** > > > > > > _______________________________________________ > > mauiusers mailing list > > mauiusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/mauiusers > > _______________________________________________ > mauiusers mailing list > mauiusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/mauiusers From jasonw at Jhu.edu Wed Nov 9 07:26:19 2011 From: jasonw at Jhu.edu (Jason Williams) Date: Wed, 09 Nov 2011 09:26:19 -0500 Subject: [Mauiusers] Possible Memory Corruption in maui In-Reply-To: <000701cc9ec3$53777210$fa665630$@de> References: <002601cc9e2c$c7a3c450$56eb4cf0$@de> <4EB95446.7000908@sara.nl> <003301cc9e39$387f40d0$a97dc270$@de> <4EB9B233.9030307@Jhu.edu> <000701cc9ec3$53777210$fa665630$@de> Message-ID: <4EBA8D8B.3090609@Jhu.edu> Dr. Stephan Raub and Joerg Blank, It looks like the binaries you all have are stripped of their debug symbols which is going to make my idea of tracing the crash in the maui code next to impossible. However, I'm not entirely convinced this is a maui bug as the last calls before the libc calls are ones through the torque library. Joerg: What version of Torque do you have? I think my next step here would be to either a) load any -debuginfo rpm if you installed it via RPM and try running with the -d again to hopefully get some debug info in the back trace b) try running maui -d through gdb and see if you can get some useful information there and/or c) if you compiled it from source, disable stripping the debug symbols and recompile it to try to get some more information in the backtrace. With out some useful information as to where in the maui binary things are when the crash happens, I can't start looking to see what happened. -- Jason On 11/9/2011 4:38 AM, Dr. Stephan Raub wrote: > Dear Jason Williams, > > thank you for your hint. Please, find below the result of our Maui running > with the "-d" command line option (maui was running about 5 minutes before > it crashed): > > # /usr/local/maui/sbin/maui -d > *** glibc detected *** /usr/local/maui/sbin/maui: malloc(): memory > corruption: 0x00000000099243e0 *** > ======= Backtrace: ========= > /lib64/libc.so.6[0x3300672fae] > /lib64/libc.so.6(__libc_malloc+0x6e)[0x3300674cde] > /usr/local/torque/lib/libtorque.so.2(decode_DIS_replyCmd+0x266)[0x2ab278cb18 > e6] > /usr/local/torque/lib/libtorque.so.2(PBSD_rdrpy+0x80)[0x2ab278cb56d0] > /usr/local/torque/lib/libtorque.so.2(PBSD_status_get+0x26)[0x2ab278cb6786] > /usr/local/maui/sbin/maui[0x4d9e59] > /usr/local/maui/sbin/maui[0x48b8e4] > /usr/local/maui/sbin/maui[0x48b84f] > /usr/local/maui/sbin/maui[0x4ce81c] > /usr/local/maui/sbin/maui[0x4ce39e] > /usr/local/maui/sbin/maui[0x4419eb] > /usr/local/maui/sbin/maui[0x403608] > /lib64/libc.so.6(__libc_start_main+0xf4)[0x330061d994] > /usr/local/maui/sbin/maui[0x402cd9] > ======= Memory map: ======== > 00400000-0054f000 r-xp 00000000 08:03 50266128 /usr/local/maui/sbin/maui > 0074f000-00754000 rw-p 0014f000 08:03 50266128 /usr/local/maui/sbin/maui > 00754000-02344000 rw-p 00754000 00:00 0 > 0984b000-188f1000 rw-p 0984b000 00:00 0 [heap] > 3300200000-330021c000 r-xp 00000000 08:03 18186265 /lib64/ld-2.5.so > 330041b000-330041c000 r--p 0001b000 08:03 18186265 /lib64/ld-2.5.so > 330041c000-330041d000 rw-p 0001c000 08:03 18186265 /lib64/ld-2.5.so > 3300600000-330074e000 r-xp 00000000 08:03 18186304 /lib64/libc-2.5.so > 330074e000-330094d000 ---p 0014e000 08:03 18186304 /lib64/libc-2.5.so > 330094d000-3300951000 r--p 0014d000 08:03 18186304 /lib64/libc-2.5.so > 3300951000-3300952000 rw-p 00151000 08:03 18186304 /lib64/libc-2.5.so > 3300952000-3300957000 rw-p 3300952000 00:00 0 > 3300a00000-3300a02000 r-xp 00000000 08:03 18186457 /lib64/libdl-2.5.so > 3300a02000-3300c02000 ---p 00002000 08:03 18186457 /lib64/libdl-2.5.so > 3300c02000-3300c03000 r--p 00002000 08:03 18186457 /lib64/libdl-2.5.so > 3300c03000-3300c04000 rw-p 00003000 08:03 18186457 /lib64/libdl-2.5.so > 3300e00000-3300e82000 r-xp 00000000 08:03 18186543 /lib64/libm-2.5.so > 3300e82000-3301081000 ---p 00082000 08:03 18186543 /lib64/libm-2.5.so > 3301081000-3301082000 r--p 00081000 08:03 18186543 /lib64/libm-2.5.so > 3301082000-3301083000 rw-p 00082000 08:03 18186543 /lib64/libm-2.5.so > 3303a00000-3303a0d000 r-xp 00000000 08:03 18186545 > /lib64/libgcc_s-4.1.2-20080825.so.1 > 3303a0d000-3303c0d000 ---p 0000d000 08:03 18186545 > /lib64/libgcc_s-4.1.2-20080825.so.1 > 3303c0d000-3303c0e000 rw-p 0000d000 08:03 18186545 > /lib64/libgcc_s-4.1.2-20080825.so.1 > 3304a00000-3304a15000 r-xp 00000000 08:03 18186491 /lib64/libselinux.so.1 > 3304a15000-3304c15000 ---p 00015000 08:03 18186491 /lib64/libselinux.so.1 > 3304c15000-3304c17000 rw-p 00015000 08:03 18186491 /lib64/libselinux.so.1 > 3304c17000-3304c18000 rw-p 3304c17000 00:00 0 > 3304e00000-3304e3b000 r-xp 00000000 08:03 18186479 /lib64/libsepol.so.1 > 3304e3b000-330503b000 ---p 0003b000 08:03 18186479 /lib64/libsepol.so.1 > 330503b000-330503c000 rw-p 0003b000 08:03 18186479 /lib64/libsepol.so.1 > 330503c000-3305046000 rw-p 330503c000 00:00 0 > 3305e00000-3305e02000 r-xp 00000000 08:03 18186469 /lib64/libkeyutils-1.3.so > 3305e02000-3306001000 ---p 00002000 08:03 18186469 /lib64/libkeyutils-1.3.so > 3306001000-3306002000 rw-p 00001000 08:03 18186469 /lib64/libkeyutils-1.3.so > 3306200000-3306211000 r-xp 00000000 08:03 18186474 /lib64/libresolv-2.5.so > 3306211000-3306411000 ---p 00011000 08:03 18186474 /lib64/libresolv-2.5.so > 3306411000-3306412000 r--p 00Aborted > > Thank you for your efforts. > > Stephan > -- > --------------------------------------------------------- > | | Dr. rer. nat. Stephan Raub > | | Dipl. Chem. > | | High-Performance-Computing > | | Zentrum f?r Informations- und Medientechnologie > | | Heinrich-Heine-Universit?t D?sseldorf > | | Universit?tsstr. 1 / Raum 25.41.O2.25-2 > | | 40225 D?sseldorf / Germany > | | > | | Tel: +49-211-811-3911 > | | Fax: +49-211-811-2539 > --------------------------------------------------------- > > Wichtiger Hinweis: Diese E-Mail kann Betriebs- oder Gesch?ftsgeheimnisse, > bzw. > sonstige vertrauliche Informationen enthalten. Sollten Sie diese E-Mail > irrt?mlich erhalten haben, ist Ihnen eine Kenntnisnahme des Inhalts, eine > Vervielf?ltigung oder Weitergabe der E-Mail ausdr?cklich untersagt. Bitte > benachrichtigen Sie uns und vernichten Sie die empfangene E-Mail. Vielen > Dank. > > Important Note: This e-mail may contain trade secrets or privileged, > undisclosed or otherwise confidential information. If you have received this > e-mail in error, you are hereby notified that any review, copying or > distribution of it is strictly prohibited. Please inform us immediately and > destroy the original transmittal. Thank you for your cooperation. > >> -----Urspr?ngliche Nachricht----- >> Von: mauiusers-bounces at supercluster.org [mailto:mauiusers- >> bounces at supercluster.org] Im Auftrag von Jason Williams >> Gesendet: Dienstag, 8. November 2011 23:50 >> An: mauiusers at supercluster.org >> Betreff: Re: [Mauiusers] Possible Memory Corruption in maui >> >> Dr Stephan Raub, >> >> Maui does have some very odd "memory management" in it that has a >> tendency to cause these types of crashes when run in high volume >> situations without some tweaks and/or concessions. I've tracked down, >> and I think fixed, one in the latest svn trunk, but 3.3.1 should >> already have that fix in it. >> >> Can/have you tried running maui from the command line with the -d line >> and catching the corrupt memory and back trace that comes out of it? >> Your original email has the strace, but it cuts off some of the >> backtrace. I might be able to see where in the code it's having >> problems, if I can get the full back trace. >> >> >> -- >> Jason Williams >> Systems Engineer >> Homewood High Performance Cluster >> Johns Hopkins University >> >> On 11/8/2011 12:09 PM, Dr. Stephan Raub wrote: >>> Dear Mr. van der Vlies >>> >>> Currently we have 6095 Jobs queued and 93 Jobs running. Amoung these, >>> we have some large job arrays (1000 and 4000 items per array). >>> >>> Best regards. >>> -- >>> --------------------------------------------------------- >>> | | Dr. rer. nat. Stephan Raub >>> | | Dipl. Chem. >>> | | High-Performance-Computing >>> | | Zentrum f?r Informations- und Medientechnologie >>> | | Heinrich-Heine-Universit?t D?sseldorf Universit?tsstr. 1 / Raum >>> | | 25.41.O2.25-2 >>> | | 40225 D?sseldorf / Germany >>> | | >>> | | Tel: +49-211-811-3911 >>> | | Fax: +49-211-811-2539 >>> --------------------------------------------------------- >>> >>> Wichtiger Hinweis: Diese E-Mail kann Betriebs- oder >>> Gesch?ftsgeheimnisse, bzw. >>> sonstige vertrauliche Informationen enthalten. Sollten Sie diese >>> E-Mail irrt?mlich erhalten haben, ist Ihnen eine Kenntnisnahme des >>> Inhalts, eine Vervielf?ltigung oder Weitergabe der E-Mail >> ausdr?cklich >>> untersagt. Bitte benachrichtigen Sie uns und vernichten Sie die >>> empfangene E-Mail. Vielen Dank. >>> >>> Important Note: This e-mail may contain trade secrets or privileged, >>> undisclosed or otherwise confidential information. If you have >>> received this e-mail in error, you are hereby notified that any >>> review, copying or distribution of it is strictly prohibited. Please >>> inform us immediately and destroy the original transmittal. Thank you >> for your cooperation. >>> >>>> -----Urspr?ngliche Nachricht----- >>>> Von: Bas van der Vlies [mailto:basv at sara.nl] >>>> Gesendet: Dienstag, 8. November 2011 17:10 >>>> An: Dr. Stephan Raub >>>> Betreff: Re: [Mauiusers] Possible Memory Corruption in maui >>>> >>>> On 08-11-11 16:40, Dr. Stephan Raub wrote: >>>>> Dear fellow maui users, >>>>> >>>>> we are running Maui 3.3.1 with torque 2.3.7 under RHEL5.5 >>>>> (2.6.8-194.26.1.el1) on a 600-somewhat core cluster. >>>>> >>>>> We experienced a sudden death of the maui scheduler with no message >>>> in the >>>>> logs. We could not figure out a reason so we attached an "strace" >> to >>>> the >>>>> maui process (as long as it was "still alive") and we got: >>>>> >>>> Dear Dr. Stephan Raub, >>>> >>>> just a question: How many jobs are in the queue? >>>> >>>> regards >>>> >>>> >>>> -- >>>> ******************************************************************** >>>> * Bas van der Vlies e-mail: basv at sara.nl * >>>> * SARA - Academic Computing Services Amsterdam, The Netherlands * >>>> ******************************************************************** >>> >>> _______________________________________________ >>> mauiusers mailing list >>> mauiusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/mauiusers >> _______________________________________________ >> mauiusers mailing list >> mauiusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/mauiusers > > _______________________________________________ > mauiusers mailing list > mauiusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/mauiusers From j.blank at fz-juelich.de Wed Nov 9 07:56:15 2011 From: j.blank at fz-juelich.de (Joerg Blank) Date: Wed, 09 Nov 2011 15:56:15 +0100 Subject: [Mauiusers] Possible Memory Corruption in maui In-Reply-To: <4EBA8D8B.3090609@Jhu.edu> References: <002601cc9e2c$c7a3c450$56eb4cf0$@de> <4EB95446.7000908@sara.nl> <003301cc9e39$387f40d0$a97dc270$@de> <4EB9B233.9030307@Jhu.edu> <000701cc9ec3$53777210$fa665630$@de> <4EBA8D8B.3090609@Jhu.edu> Message-ID: > Joerg: What version of Torque do you have? Tested it with 2.5.8 and 2.5.9 I'll recompile Maui and see what happens. Regards, J?rg Blank From raub at uni-duesseldorf.de Wed Nov 9 12:39:29 2011 From: raub at uni-duesseldorf.de (Dr. Stephan Raub) Date: Wed, 09 Nov 2011 20:39:29 +0100 Subject: [Mauiusers] Possible Memory Corruption in maui In-Reply-To: <4EBA8D8B.3090609@Jhu.edu> References: <002601cc9e2c$c7a3c450$56eb4cf0$@de> <4EB95446.7000908@sara.nl> <003301cc9e39$387f40d0$a97dc270$@de> <4EB9B233.9030307@Jhu.edu> <000701cc9ec3$53777210$fa665630$@de> <4EBA8D8B.3090609@Jhu.edu> Message-ID: <002201cc9f17$4c53b440$e4fb1cc0$@de> Hello, > However, I'm not entirely convinced this is a > maui bug as the last calls before the libc calls are ones through the > torque library. I totally agree. We dived into the code of maui and found out, that the error occurs while calling "pbs_statnode()" (MPBSI.c, line 1268). The "memory corruption" seems to be thrown not in maui but in the called torque-function PBSD_status_get() (which is called by PBSD_status()) in PBSD_status.c. Currently, we assume an error in building the (struct batch_status) *next entries of this list. It seems, I have to apologize for bothering the maui list with this problem. ;-) Thank you for all of you for your comments and suggestions. It eventually has lead us in the right direction. Best regards Stephan -- --------------------------------------------------------- | | Dr. rer. nat. Stephan Raub | | Dipl. Chem. | | High-Performance-Computing | | Zentrum f?r Informations- und Medientechnologie | | Heinrich-Heine-Universit?t D?sseldorf | | Universit?tsstr. 1 / Raum 25.41.O2.25-2 | | 40225 D?sseldorf / Germany | | | | Tel: +49-211-811-3911 | | Fax: +49-211-811-2539 --------------------------------------------------------- Wichtiger Hinweis: Diese E-Mail kann Betriebs- oder Gesch?ftsgeheimnisse, bzw. sonstige vertrauliche Informationen enthalten. Sollten Sie diese E-Mail irrt?mlich erhalten haben, ist Ihnen eine Kenntnisnahme des Inhalts, eine Vervielf?ltigung oder Weitergabe der E-Mail ausdr?cklich untersagt. Bitte benachrichtigen Sie uns und vernichten Sie die empfangene E-Mail. Vielen Dank. Important Note: This e-mail may contain trade secrets or privileged, undisclosed or otherwise confidential information. If you have received this e-mail in error, you are hereby notified that any review, copying or distribution of it is strictly prohibited. Please inform us immediately and destroy the original transmittal. Thank you for your cooperation. > -----Urspr?ngliche Nachricht----- > Von: mauiusers-bounces at supercluster.org [mailto:mauiusers- > bounces at supercluster.org] Im Auftrag von Jason Williams > Gesendet: Mittwoch, 9. November 2011 15:26 > An: mauiusers at supercluster.org > Betreff: Re: [Mauiusers] Possible Memory Corruption in maui > > Dr. Stephan Raub and Joerg Blank, > > It looks like the binaries you all have are stripped of their debug > symbols which is going to make my idea of tracing the crash in the maui > code next to impossible. However, I'm not entirely convinced this is a > maui bug as the last calls before the libc calls are ones through the > torque library. > > Joerg: What version of Torque do you have? > > I think my next step here would be to either a) load any -debuginfo rpm > if you installed it via RPM and try running with the -d again to > hopefully get some debug info in the back trace b) try running maui -d > through gdb and see if you can get some useful information there and/or > c) if you compiled it from source, disable stripping the debug symbols > and recompile it to try to get some more information in the backtrace. > With out some useful information as to where in the maui binary things > are when the crash happens, I can't start looking to see what happened. > > -- > Jason > > On 11/9/2011 4:38 AM, Dr. Stephan Raub wrote: > > Dear Jason Williams, > > > > thank you for your hint. Please, find below the result of our Maui > > running with the "-d" command line option (maui was running about 5 > > minutes before it crashed): > > > > # /usr/local/maui/sbin/maui -d > > *** glibc detected *** /usr/local/maui/sbin/maui: malloc(): memory > > corruption: 0x00000000099243e0 *** > > ======= Backtrace: ========= > > /lib64/libc.so.6[0x3300672fae] > > /lib64/libc.so.6(__libc_malloc+0x6e)[0x3300674cde] > > > /usr/local/torque/lib/libtorque.so.2(decode_DIS_replyCmd+0x266)[0x2ab2 > > 78cb18 > > e6] > > /usr/local/torque/lib/libtorque.so.2(PBSD_rdrpy+0x80)[0x2ab278cb56d0] > > > /usr/local/torque/lib/libtorque.so.2(PBSD_status_get+0x26)[0x2ab278cb6 > > 786] > > /usr/local/maui/sbin/maui[0x4d9e59] > > /usr/local/maui/sbin/maui[0x48b8e4] > > /usr/local/maui/sbin/maui[0x48b84f] > > /usr/local/maui/sbin/maui[0x4ce81c] > > /usr/local/maui/sbin/maui[0x4ce39e] > > /usr/local/maui/sbin/maui[0x4419eb] > > /usr/local/maui/sbin/maui[0x403608] > > /lib64/libc.so.6(__libc_start_main+0xf4)[0x330061d994] > > /usr/local/maui/sbin/maui[0x402cd9] > > ======= Memory map: ======== > > 00400000-0054f000 r-xp 00000000 08:03 50266128 > > /usr/local/maui/sbin/maui 0074f000-00754000 rw-p 0014f000 08:03 > > 50266128 /usr/local/maui/sbin/maui 00754000-02344000 rw-p 00754000 > > 00:00 0 0984b000-188f1000 rw-p 0984b000 00:00 0 [heap] > > 3300200000-330021c000 r-xp 00000000 08:03 18186265 /lib64/ld-2.5.so > > 330041b000-330041c000 r--p 0001b000 08:03 18186265 /lib64/ld-2.5.so > > 330041c000-330041d000 rw-p 0001c000 08:03 18186265 /lib64/ld-2.5.so > > 3300600000-330074e000 r-xp 00000000 08:03 18186304 /lib64/libc-2.5.so > > 330074e000-330094d000 ---p 0014e000 08:03 18186304 /lib64/libc-2.5.so > > 330094d000-3300951000 r--p 0014d000 08:03 18186304 /lib64/libc-2.5.so > > 3300951000-3300952000 rw-p 00151000 08:03 18186304 /lib64/libc-2.5.so > > 3300952000-3300957000 rw-p 3300952000 00:00 0 3300a00000-3300a02000 > > r-xp 00000000 08:03 18186457 /lib64/libdl-2.5.so 3300a02000- > 3300c02000 > > ---p 00002000 08:03 18186457 /lib64/libdl-2.5.so 3300c02000- > 3300c03000 > > r--p 00002000 08:03 18186457 /lib64/libdl-2.5.so 3300c03000- > 3300c04000 > > rw-p 00003000 08:03 18186457 /lib64/libdl-2.5.so 3300e00000- > 3300e82000 > > r-xp 00000000 08:03 18186543 /lib64/libm-2.5.so 3300e82000-3301081000 > > ---p 00082000 08:03 18186543 /lib64/libm-2.5.so 3301081000-3301082000 > > r--p 00081000 08:03 18186543 /lib64/libm-2.5.so 3301082000-3301083000 > > rw-p 00082000 08:03 18186543 /lib64/libm-2.5.so 3303a00000-3303a0d000 > > r-xp 00000000 08:03 18186545 > > /lib64/libgcc_s-4.1.2-20080825.so.1 > > 3303a0d000-3303c0d000 ---p 0000d000 08:03 18186545 > > /lib64/libgcc_s-4.1.2-20080825.so.1 > > 3303c0d000-3303c0e000 rw-p 0000d000 08:03 18186545 > > /lib64/libgcc_s-4.1.2-20080825.so.1 > > 3304a00000-3304a15000 r-xp 00000000 08:03 18186491 > > /lib64/libselinux.so.1 3304a15000-3304c15000 ---p 00015000 08:03 > > 18186491 /lib64/libselinux.so.1 3304c15000-3304c17000 rw-p 00015000 > > 08:03 18186491 /lib64/libselinux.so.1 3304c17000-3304c18000 rw-p > > 3304c17000 00:00 0 3304e00000-3304e3b000 r-xp 00000000 08:03 18186479 > > /lib64/libsepol.so.1 3304e3b000-330503b000 ---p 0003b000 08:03 > > 18186479 /lib64/libsepol.so.1 330503b000-330503c000 rw-p 0003b000 > > 08:03 18186479 /lib64/libsepol.so.1 330503c000-3305046000 rw-p > > 330503c000 00:00 0 3305e00000-3305e02000 r-xp 00000000 08:03 18186469 > > /lib64/libkeyutils-1.3.so 3305e02000-3306001000 ---p 00002000 08:03 > > 18186469 /lib64/libkeyutils-1.3.so 3306001000-3306002000 rw-p > 00001000 > > 08:03 18186469 /lib64/libkeyutils-1.3.so 3306200000-3306211000 r-xp > > 00000000 08:03 18186474 /lib64/libresolv-2.5.so 3306211000-3306411000 > > ---p 00011000 08:03 18186474 /lib64/libresolv-2.5.so > > 3306411000-3306412000 r--p 00Aborted > > > > Thank you for your efforts. > > > > Stephan > > -- > > --------------------------------------------------------- > > | | Dr. rer. nat. Stephan Raub > > | | Dipl. Chem. > > | | High-Performance-Computing > > | | Zentrum f?r Informations- und Medientechnologie > > | | Heinrich-Heine-Universit?t D?sseldorf Universit?tsstr. 1 / Raum > > | | 25.41.O2.25-2 > > | | 40225 D?sseldorf / Germany > > | | > > | | Tel: +49-211-811-3911 > > | | Fax: +49-211-811-2539 > > --------------------------------------------------------- > > > > Wichtiger Hinweis: Diese E-Mail kann Betriebs- oder > > Gesch?ftsgeheimnisse, bzw. > > sonstige vertrauliche Informationen enthalten. Sollten Sie diese > > E-Mail irrt?mlich erhalten haben, ist Ihnen eine Kenntnisnahme des > > Inhalts, eine Vervielf?ltigung oder Weitergabe der E-Mail > ausdr?cklich > > untersagt. Bitte benachrichtigen Sie uns und vernichten Sie die > > empfangene E-Mail. Vielen Dank. > > > > Important Note: This e-mail may contain trade secrets or privileged, > > undisclosed or otherwise confidential information. If you have > > received this e-mail in error, you are hereby notified that any > > review, copying or distribution of it is strictly prohibited. Please > > inform us immediately and destroy the original transmittal. Thank you > for your cooperation. > > > >> -----Urspr?ngliche Nachricht----- > >> Von: mauiusers-bounces at supercluster.org [mailto:mauiusers- > >> bounces at supercluster.org] Im Auftrag von Jason Williams > >> Gesendet: Dienstag, 8. November 2011 23:50 > >> An: mauiusers at supercluster.org > >> Betreff: Re: [Mauiusers] Possible Memory Corruption in maui > >> > >> Dr Stephan Raub, > >> > >> Maui does have some very odd "memory management" in it that has a > >> tendency to cause these types of crashes when run in high volume > >> situations without some tweaks and/or concessions. I've tracked > >> down, and I think fixed, one in the latest svn trunk, but 3.3.1 > >> should already have that fix in it. > >> > >> Can/have you tried running maui from the command line with the -d > >> line and catching the corrupt memory and back trace that comes out > of it? > >> Your original email has the strace, but it cuts off some of the > >> backtrace. I might be able to see where in the code it's having > >> problems, if I can get the full back trace. > >> > >> > >> -- > >> Jason Williams > >> Systems Engineer > >> Homewood High Performance Cluster > >> Johns Hopkins University > >> > >> On 11/8/2011 12:09 PM, Dr. Stephan Raub wrote: > >>> Dear Mr. van der Vlies > >>> > >>> Currently we have 6095 Jobs queued and 93 Jobs running. Amoung > >>> these, we have some large job arrays (1000 and 4000 items per > array). > >>> > >>> Best regards. > >>> -- > >>> --------------------------------------------------------- > >>> | | Dr. rer. nat. Stephan Raub > >>> | | Dipl. Chem. > >>> | | High-Performance-Computing > >>> | | Zentrum f?r Informations- und Medientechnologie > >>> | | Heinrich-Heine-Universit?t D?sseldorf Universit?tsstr. 1 / Raum > >>> | | 25.41.O2.25-2 > >>> | | 40225 D?sseldorf / Germany > >>> | | > >>> | | Tel: +49-211-811-3911 > >>> | | Fax: +49-211-811-2539 > >>> --------------------------------------------------------- > >>> > >>> Wichtiger Hinweis: Diese E-Mail kann Betriebs- oder > >>> Gesch?ftsgeheimnisse, bzw. > >>> sonstige vertrauliche Informationen enthalten. Sollten Sie diese > >>> E-Mail irrt?mlich erhalten haben, ist Ihnen eine Kenntnisnahme des > >>> Inhalts, eine Vervielf?ltigung oder Weitergabe der E-Mail > >> ausdr?cklich > >>> untersagt. Bitte benachrichtigen Sie uns und vernichten Sie die > >>> empfangene E-Mail. Vielen Dank. > >>> > >>> Important Note: This e-mail may contain trade secrets or > privileged, > >>> undisclosed or otherwise confidential information. If you have > >>> received this e-mail in error, you are hereby notified that any > >>> review, copying or distribution of it is strictly prohibited. > Please > >>> inform us immediately and destroy the original transmittal. Thank > >>> you > >> for your cooperation. > >>> > >>>> -----Urspr?ngliche Nachricht----- > >>>> Von: Bas van der Vlies [mailto:basv at sara.nl] > >>>> Gesendet: Dienstag, 8. November 2011 17:10 > >>>> An: Dr. Stephan Raub > >>>> Betreff: Re: [Mauiusers] Possible Memory Corruption in maui > >>>> > >>>> On 08-11-11 16:40, Dr. Stephan Raub wrote: > >>>>> Dear fellow maui users, > >>>>> > >>>>> we are running Maui 3.3.1 with torque 2.3.7 under RHEL5.5 > >>>>> (2.6.8-194.26.1.el1) on a 600-somewhat core cluster. > >>>>> > >>>>> We experienced a sudden death of the maui scheduler with no > >>>>> message > >>>> in the > >>>>> logs. We could not figure out a reason so we attached an "strace" > >> to > >>>> the > >>>>> maui process (as long as it was "still alive") and we got: > >>>>> > >>>> Dear Dr. Stephan Raub, > >>>> > >>>> just a question: How many jobs are in the queue? > >>>> > >>>> regards > >>>> > >>>> > >>>> -- > >>>> > ******************************************************************** > >>>> * Bas van der Vlies e-mail: basv at sara.nl > * > >>>> * SARA - Academic Computing Services Amsterdam, The Netherlands > * > >>>> > ******************************************************************* > >>>> * > >>> > >>> _______________________________________________ > >>> mauiusers mailing list > >>> mauiusers at supercluster.org > >>> http://www.supercluster.org/mailman/listinfo/mauiusers > >> _______________________________________________ > >> mauiusers mailing list > >> mauiusers at supercluster.org > >> http://www.supercluster.org/mailman/listinfo/mauiusers > > > > _______________________________________________ > > mauiusers mailing list > > mauiusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/mauiusers > > _______________________________________________ > mauiusers mailing list > mauiusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/mauiusers From jasonw at Jhu.edu Wed Nov 9 13:25:58 2011 From: jasonw at Jhu.edu (Jason Williams) Date: Wed, 09 Nov 2011 15:25:58 -0500 Subject: [Mauiusers] Possible Memory Corruption in maui In-Reply-To: <002201cc9f17$4c53b440$e4fb1cc0$@de> References: <002601cc9e2c$c7a3c450$56eb4cf0$@de> <4EB95446.7000908@sara.nl> <003301cc9e39$387f40d0$a97dc270$@de> <4EB9B233.9030307@Jhu.edu> <000701cc9ec3$53777210$fa665630$@de> <4EBA8D8B.3090609@Jhu.edu> <002201cc9f17$4c53b440$e4fb1cc0$@de> Message-ID: <4EBAE1D6.8040409@Jhu.edu> Dr. Raub, If you find a solution to the problem, please let me know and/or post back to the maui list. I don't really monitor the torque lists, they're a bit higher volume. I'll also be curious to know if they try to pass the issue back to the maui side and/or if they don't respond. I'm just glad you are hopefully headed toward a solution now. -- Jason On 11/9/2011 2:39 PM, Dr. Stephan Raub wrote: > Hello, > > >> However, I'm not entirely convinced this is a >> maui bug as the last calls before the libc calls are ones through the >> torque library. > I totally agree. We dived into the code of maui and found out, that the > error occurs while calling "pbs_statnode()" (MPBSI.c, line 1268). The > "memory corruption" seems to be thrown not in maui but in the called > torque-function PBSD_status_get() (which is called by PBSD_status()) in > PBSD_status.c. Currently, we assume an error in building the (struct > batch_status) *next entries of this list. > > It seems, I have to apologize for bothering the maui list with this problem. > ;-) Thank you for all of you for your comments and suggestions. It > eventually has lead us in the right direction. > > Best regards > > Stephan > -- > --------------------------------------------------------- > | | Dr. rer. nat. Stephan Raub > | | Dipl. Chem. > | | High-Performance-Computing > | | Zentrum f?r Informations- und Medientechnologie > | | Heinrich-Heine-Universit?t D?sseldorf > | | Universit?tsstr. 1 / Raum 25.41.O2.25-2 > | | 40225 D?sseldorf / Germany > | | > | | Tel: +49-211-811-3911 > | | Fax: +49-211-811-2539 > --------------------------------------------------------- > > Wichtiger Hinweis: Diese E-Mail kann Betriebs- oder Gesch?ftsgeheimnisse, > bzw. > sonstige vertrauliche Informationen enthalten. Sollten Sie diese E-Mail > irrt?mlich erhalten haben, ist Ihnen eine Kenntnisnahme des Inhalts, eine > Vervielf?ltigung oder Weitergabe der E-Mail ausdr?cklich untersagt. Bitte > benachrichtigen Sie uns und vernichten Sie die empfangene E-Mail. Vielen > Dank. > > Important Note: This e-mail may contain trade secrets or privileged, > undisclosed or otherwise confidential information. If you have received this > e-mail in error, you are hereby notified that any review, copying or > distribution of it is strictly prohibited. Please inform us immediately and > destroy the original transmittal. Thank you for your cooperation. > > >> -----Urspr?ngliche Nachricht----- >> Von: mauiusers-bounces at supercluster.org [mailto:mauiusers- >> bounces at supercluster.org] Im Auftrag von Jason Williams >> Gesendet: Mittwoch, 9. November 2011 15:26 >> An: mauiusers at supercluster.org >> Betreff: Re: [Mauiusers] Possible Memory Corruption in maui >> >> Dr. Stephan Raub and Joerg Blank, >> >> It looks like the binaries you all have are stripped of their debug >> symbols which is going to make my idea of tracing the crash in the maui >> code next to impossible. However, I'm not entirely convinced this is a >> maui bug as the last calls before the libc calls are ones through the >> torque library. >> >> Joerg: What version of Torque do you have? >> >> I think my next step here would be to either a) load any -debuginfo rpm >> if you installed it via RPM and try running with the -d again to >> hopefully get some debug info in the back trace b) try running maui -d >> through gdb and see if you can get some useful information there and/or >> c) if you compiled it from source, disable stripping the debug symbols >> and recompile it to try to get some more information in the backtrace. >> With out some useful information as to where in the maui binary things >> are when the crash happens, I can't start looking to see what happened. >> >> -- >> Jason >> >> On 11/9/2011 4:38 AM, Dr. Stephan Raub wrote: >>> Dear Jason Williams, >>> >>> thank you for your hint. Please, find below the result of our Maui >>> running with the "-d" command line option (maui was running about 5 >>> minutes before it crashed): >>> >>> # /usr/local/maui/sbin/maui -d >>> *** glibc detected *** /usr/local/maui/sbin/maui: malloc(): memory >>> corruption: 0x00000000099243e0 *** >>> ======= Backtrace: ========= >>> /lib64/libc.so.6[0x3300672fae] >>> /lib64/libc.so.6(__libc_malloc+0x6e)[0x3300674cde] >>> >> /usr/local/torque/lib/libtorque.so.2(decode_DIS_replyCmd+0x266)[0x2ab2 >>> 78cb18 >>> e6] >>> /usr/local/torque/lib/libtorque.so.2(PBSD_rdrpy+0x80)[0x2ab278cb56d0] >>> >> /usr/local/torque/lib/libtorque.so.2(PBSD_status_get+0x26)[0x2ab278cb6 >>> 786] >>> /usr/local/maui/sbin/maui[0x4d9e59] >>> /usr/local/maui/sbin/maui[0x48b8e4] >>> /usr/local/maui/sbin/maui[0x48b84f] >>> /usr/local/maui/sbin/maui[0x4ce81c] >>> /usr/local/maui/sbin/maui[0x4ce39e] >>> /usr/local/maui/sbin/maui[0x4419eb] >>> /usr/local/maui/sbin/maui[0x403608] >>> /lib64/libc.so.6(__libc_start_main+0xf4)[0x330061d994] >>> /usr/local/maui/sbin/maui[0x402cd9] >>> ======= Memory map: ======== >>> 00400000-0054f000 r-xp 00000000 08:03 50266128 >>> /usr/local/maui/sbin/maui 0074f000-00754000 rw-p 0014f000 08:03 >>> 50266128 /usr/local/maui/sbin/maui 00754000-02344000 rw-p 00754000 >>> 00:00 0 0984b000-188f1000 rw-p 0984b000 00:00 0 [heap] >>> 3300200000-330021c000 r-xp 00000000 08:03 18186265 /lib64/ld-2.5.so >>> 330041b000-330041c000 r--p 0001b000 08:03 18186265 /lib64/ld-2.5.so >>> 330041c000-330041d000 rw-p 0001c000 08:03 18186265 /lib64/ld-2.5.so >>> 3300600000-330074e000 r-xp 00000000 08:03 18186304 /lib64/libc-2.5.so >>> 330074e000-330094d000 ---p 0014e000 08:03 18186304 /lib64/libc-2.5.so >>> 330094d000-3300951000 r--p 0014d000 08:03 18186304 /lib64/libc-2.5.so >>> 3300951000-3300952000 rw-p 00151000 08:03 18186304 /lib64/libc-2.5.so >>> 3300952000-3300957000 rw-p 3300952000 00:00 0 3300a00000-3300a02000 >>> r-xp 00000000 08:03 18186457 /lib64/libdl-2.5.so 3300a02000- >> 3300c02000 >>> ---p 00002000 08:03 18186457 /lib64/libdl-2.5.so 3300c02000- >> 3300c03000 >>> r--p 00002000 08:03 18186457 /lib64/libdl-2.5.so 3300c03000- >> 3300c04000 >>> rw-p 00003000 08:03 18186457 /lib64/libdl-2.5.so 3300e00000- >> 3300e82000 >>> r-xp 00000000 08:03 18186543 /lib64/libm-2.5.so 3300e82000-3301081000 >>> ---p 00082000 08:03 18186543 /lib64/libm-2.5.so 3301081000-3301082000 >>> r--p 00081000 08:03 18186543 /lib64/libm-2.5.so 3301082000-3301083000 >>> rw-p 00082000 08:03 18186543 /lib64/libm-2.5.so 3303a00000-3303a0d000 >>> r-xp 00000000 08:03 18186545 >>> /lib64/libgcc_s-4.1.2-20080825.so.1 >>> 3303a0d000-3303c0d000 ---p 0000d000 08:03 18186545 >>> /lib64/libgcc_s-4.1.2-20080825.so.1 >>> 3303c0d000-3303c0e000 rw-p 0000d000 08:03 18186545 >>> /lib64/libgcc_s-4.1.2-20080825.so.1 >>> 3304a00000-3304a15000 r-xp 00000000 08:03 18186491 >>> /lib64/libselinux.so.1 3304a15000-3304c15000 ---p 00015000 08:03 >>> 18186491 /lib64/libselinux.so.1 3304c15000-3304c17000 rw-p 00015000 >>> 08:03 18186491 /lib64/libselinux.so.1 3304c17000-3304c18000 rw-p >>> 3304c17000 00:00 0 3304e00000-3304e3b000 r-xp 00000000 08:03 18186479 >>> /lib64/libsepol.so.1 3304e3b000-330503b000 ---p 0003b000 08:03 >>> 18186479 /lib64/libsepol.so.1 330503b000-330503c000 rw-p 0003b000 >>> 08:03 18186479 /lib64/libsepol.so.1 330503c000-3305046000 rw-p >>> 330503c000 00:00 0 3305e00000-3305e02000 r-xp 00000000 08:03 18186469 >>> /lib64/libkeyutils-1.3.so 3305e02000-3306001000 ---p 00002000 08:03 >>> 18186469 /lib64/libkeyutils-1.3.so 3306001000-3306002000 rw-p >> 00001000 >>> 08:03 18186469 /lib64/libkeyutils-1.3.so 3306200000-3306211000 r-xp >>> 00000000 08:03 18186474 /lib64/libresolv-2.5.so 3306211000-3306411000 >>> ---p 00011000 08:03 18186474 /lib64/libresolv-2.5.so >>> 3306411000-3306412000 r--p 00Aborted >>> >>> Thank you for your efforts. >>> >>> Stephan >>> -- >>> --------------------------------------------------------- >>> | | Dr. rer. nat. Stephan Raub >>> | | Dipl. Chem. >>> | | High-Performance-Computing >>> | | Zentrum f?r Informations- und Medientechnologie >>> | | Heinrich-Heine-Universit?t D?sseldorf Universit?tsstr. 1 / Raum >>> | | 25.41.O2.25-2 >>> | | 40225 D?sseldorf / Germany >>> | | >>> | | Tel: +49-211-811-3911 >>> | | Fax: +49-211-811-2539 >>> --------------------------------------------------------- >>> >>> Wichtiger Hinweis: Diese E-Mail kann Betriebs- oder >>> Gesch?ftsgeheimnisse, bzw. >>> sonstige vertrauliche Informationen enthalten. Sollten Sie diese >>> E-Mail irrt?mlich erhalten haben, ist Ihnen eine Kenntnisnahme des >>> Inhalts, eine Vervielf?ltigung oder Weitergabe der E-Mail >> ausdr?cklich >>> untersagt. Bitte benachrichtigen Sie uns und vernichten Sie die >>> empfangene E-Mail. Vielen Dank. >>> >>> Important Note: This e-mail may contain trade secrets or privileged, >>> undisclosed or otherwise confidential information. If you have >>> received this e-mail in error, you are hereby notified that any >>> review, copying or distribution of it is strictly prohibited. Please >>> inform us immediately and destroy the original transmittal. Thank you >> for your cooperation. >>>> -----Urspr?ngliche Nachricht----- >>>> Von: mauiusers-bounces at supercluster.org [mailto:mauiusers- >>>> bounces at supercluster.org] Im Auftrag von Jason Williams >>>> Gesendet: Dienstag, 8. November 2011 23:50 >>>> An: mauiusers at supercluster.org >>>> Betreff: Re: [Mauiusers] Possible Memory Corruption in maui >>>> >>>> Dr Stephan Raub, >>>> >>>> Maui does have some very odd "memory management" in it that has a >>>> tendency to cause these types of crashes when run in high volume >>>> situations without some tweaks and/or concessions. I've tracked >>>> down, and I think fixed, one in the latest svn trunk, but 3.3.1 >>>> should already have that fix in it. >>>> >>>> Can/have you tried running maui from the command line with the -d >>>> line and catching the corrupt memory and back trace that comes out >> of it? >>>> Your original email has the strace, but it cuts off some of the >>>> backtrace. I might be able to see where in the code it's having >>>> problems, if I can get the full back trace. >>>> >>>> >>>> -- >>>> Jason Williams >>>> Systems Engineer >>>> Homewood High Performance Cluster >>>> Johns Hopkins University >>>> >>>> On 11/8/2011 12:09 PM, Dr. Stephan Raub wrote: >>>>> Dear Mr. van der Vlies >>>>> >>>>> Currently we have 6095 Jobs queued and 93 Jobs running. Amoung >>>>> these, we have some large job arrays (1000 and 4000 items per >> array). >>>>> Best regards. >>>>> -- >>>>> --------------------------------------------------------- >>>>> | | Dr. rer. nat. Stephan Raub >>>>> | | Dipl. Chem. >>>>> | | High-Performance-Computing >>>>> | | Zentrum f?r Informations- und Medientechnologie >>>>> | | Heinrich-Heine-Universit?t D?sseldorf Universit?tsstr. 1 / Raum >>>>> | | 25.41.O2.25-2 >>>>> | | 40225 D?sseldorf / Germany >>>>> | | >>>>> | | Tel: +49-211-811-3911 >>>>> | | Fax: +49-211-811-2539 >>>>> --------------------------------------------------------- >>>>> >>>>> Wichtiger Hinweis: Diese E-Mail kann Betriebs- oder >>>>> Gesch?ftsgeheimnisse, bzw. >>>>> sonstige vertrauliche Informationen enthalten. Sollten Sie diese >>>>> E-Mail irrt?mlich erhalten haben, ist Ihnen eine Kenntnisnahme des >>>>> Inhalts, eine Vervielf?ltigung oder Weitergabe der E-Mail >>>> ausdr?cklich >>>>> untersagt. Bitte benachrichtigen Sie uns und vernichten Sie die >>>>> empfangene E-Mail. Vielen Dank. >>>>> >>>>> Important Note: This e-mail may contain trade secrets or >> privileged, >>>>> undisclosed or otherwise confidential information. If you have >>>>> received this e-mail in error, you are hereby notified that any >>>>> review, copying or distribution of it is strictly prohibited. >> Please >>>>> inform us immediately and destroy the original transmittal. Thank >>>>> you >>>> for your cooperation. >>>>>> -----Urspr?ngliche Nachricht----- >>>>>> Von: Bas van der Vlies [mailto:basv at sara.nl] >>>>>> Gesendet: Dienstag, 8. November 2011 17:10 >>>>>> An: Dr. Stephan Raub >>>>>> Betreff: Re: [Mauiusers] Possible Memory Corruption in maui >>>>>> >>>>>> On 08-11-11 16:40, Dr. Stephan Raub wrote: >>>>>>> Dear fellow maui users, >>>>>>> >>>>>>> we are running Maui 3.3.1 with torque 2.3.7 under RHEL5.5 >>>>>>> (2.6.8-194.26.1.el1) on a 600-somewhat core cluster. >>>>>>> >>>>>>> We experienced a sudden death of the maui scheduler with no >>>>>>> message >>>>>> in the >>>>>>> logs. We could not figure out a reason so we attached an "strace" >>>> to >>>>>> the >>>>>>> maui process (as long as it was "still alive") and we got: >>>>>>> >>>>>> Dear Dr. Stephan Raub, >>>>>> >>>>>> just a question: How many jobs are in the queue? >>>>>> >>>>>> regards >>>>>> >>>>>> >>>>>> -- >>>>>> >> ******************************************************************** >>>>>> * Bas van der Vlies e-mail: basv at sara.nl >> * >>>>>> * SARA - Academic Computing Services Amsterdam, The Netherlands >> * >> ******************************************************************* >>>>>> * >>>>> _______________________________________________ >>>>> mauiusers mailing list >>>>> mauiusers at supercluster.org >>>>> http://www.supercluster.org/mailman/listinfo/mauiusers >>>> _______________________________________________ >>>> mauiusers mailing list >>>> mauiusers at supercluster.org >>>> http://www.supercluster.org/mailman/listinfo/mauiusers >>> _______________________________________________ >>> mauiusers mailing list >>> mauiusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/mauiusers >> _______________________________________________ >> mauiusers mailing list >> mauiusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/mauiusers > > _______________________________________________ > mauiusers mailing list > mauiusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/mauiusers From jasonw at Jhu.edu Wed Nov 9 13:27:02 2011 From: jasonw at Jhu.edu (Jason Williams) Date: Wed, 09 Nov 2011 15:27:02 -0500 Subject: [Mauiusers] Possible Memory Corruption in maui In-Reply-To: References: <002601cc9e2c$c7a3c450$56eb4cf0$@de> <4EB95446.7000908@sara.nl> <003301cc9e39$387f40d0$a97dc270$@de> <4EB9B233.9030307@Jhu.edu> <000701cc9ec3$53777210$fa665630$@de> <4EBA8D8B.3090609@Jhu.edu> Message-ID: <4EBAE216.8060101@Jhu.edu> Let me know if you get a full back trace out of it. I will definitely help in any way I can. -- Jason On 11/9/2011 9:56 AM, Joerg Blank wrote: >> Joerg: What version of Torque do you have? > Tested it with 2.5.8 and 2.5.9 > I'll recompile Maui and see what happens. > > Regards, > > J?rg Blank > > > From lance at quantumbioinc.com Fri Nov 18 07:39:17 2011 From: lance at quantumbioinc.com (Lance Westerhoff) Date: Fri, 18 Nov 2011 09:39:17 -0500 Subject: [Mauiusers] procs= not working as documented References: <5BE7BF30-0EAF-4DF0-B4E2-DECA32FEB99C@quantumbioinc.com> Message-ID: <263D778C-C50E-4681-AE3B-112FE998735F@quantumbioinc.com> Hello All- I submitted the following to the torque list, but the more I look at it, the more I think it might be a scheduler problem. It appears that when running with the following specs, the procs= option does not actually work as expected. ========================================== #PBS -S /bin/bash #PBS -l procs=60 #PBS -l pmem=700mb #PBS -l walltime=744:00:00 #PBS -j oe #PBS -q batch torque version: tried 3.0.2. in v2.5.4, I think the procs option worked as documented maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete fail in terms of the procs option and it only asks for a single CPU) ========================================== If there are fewer then 60 processors available in the cluster (in this case there were 53 available) the job will go in an take whatever processors are remaining instead of waiting for all 60 processors to free up. Any thoughts as to why this might be happening? Sometimes it doesn't really matter and 53 would be almost as good as 60, however if only 2 processors are available and the user asks for 60, I would hate for him to go in. Thank you for your time! -Lance From stevenx.a.duchene at intel.com Mon Nov 28 17:04:44 2011 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Mon, 28 Nov 2011 16:04:44 -0800 Subject: [Mauiusers] maui segfaults trying to schedule a job Message-ID: This morning I discovered that the maui scheduler process was not running on one of our clusters like it should. When I try to start the maui process as the maui user I get a segmentation fault. In checking the log files the last few entries look like this: 11/28 15:45:24 MQueueSelectAllJobs(Q,HARD,ALL,JIList,DP,Msg) 11/28 15:45:24 INFO: job '231' Priority: 605 11/28 15:45:24 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 605(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 11/28 15:45:24 MStatClearUsage([NONE],Active) 11/28 15:45:24 INFO: total jobs selected (ALL): 1/1 11/28 15:45:24 MQueueSelectAllJobs(Q,SOFT,ALL,JIList,DP,Msg) 11/28 15:45:24 INFO: job '231' Priority: 605 11/28 15:45:24 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 605(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 11/28 15:45:24 MStatClearUsage([NONE],Idle) 11/28 15:45:24 INFO: total jobs selected (ALL): 1/1 11/28 15:45:24 MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,FALSE) 11/28 15:45:24 INFO: total jobs selected in partition ALL: 1/1 11/28 15:45:24 MQueueScheduleRJobs(Q) 11/28 15:45:24 MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE) 11/28 15:45:24 INFO: total jobs selected in partition ALL: 1/1 11/28 15:45:24 MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE) 11/28 15:45:24 INFO: total jobs selected in partition DEFAULT: 1/1 11/28 15:45:24 MQueueScheduleIJobs(Q,DEFAULT) 11/28 15:45:24 INFO: 156 feasible tasks found for job 231:0 in partition DEFAULT (39 Needed) 11/28 15:45:24 INFO: 156 feasible tasks found for job 231:1 in partition DEFAULT (39 Needed) 11/28 15:45:24 INFO: 156 feasible tasks found for job 231:2 in partition DEFAULT (39 Needed) 11/28 15:45:24 INFO: 156 feasible tasks found for job 231:3 in partition DEFAULT (39 Needed) 11/28 15:45:24 INFO: 156 feasible tasks found for job 231:4 in partition DEFAULT (16 Needed) Prior to the above entries there are a WHOLE BUNCH of entries similar to these shown below: 11/28 15:45:24 MUGetIndex(TJC,ValList,0) 11/28 15:45:24 MUGetIndex(TNJA,ValList,0) 11/28 15:45:24 MUGetIndex(TNJC,ValList,0) 11/28 15:45:24 MUGetIndex(TNXF,ValList,0) 11/28 15:45:24 MUGetIndex(TPSD,ValList,0) 11/28 15:45:24 MUGetIndex(TPSE,ValList,0) 11/28 15:45:24 MUGetIndex(TPSR,ValList,0) 11/28 15:45:24 MUGetIndex(TPSU,ValList,0) 11/28 15:45:24 MUGetIndex(TQM,ValList,0) 11/28 15:45:24 MUGetIndex(TQT,ValList,0) 11/28 15:45:24 MUGetIndex(TRT,ValList,0) 11/28 15:45:24 MUGetIndex(TXF,ValList,0) There is only this one job in the queue on a 256 node cluster running torque 2.5.7 and maui 3.2.6p21 I have tried starting the maui process within strace but I do not see any smoking gun in that strace output. I can probably get maui to start if I qdel the job but I was sort of hoping to see what was causing the problem in case any additional debugging output was needed. -- Steven DuChene -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20111128/cf48f6bc/attachment.html From stevenx.a.duchene at intel.com Mon Nov 28 17:28:58 2011 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Mon, 28 Nov 2011 16:28:58 -0800 Subject: [Mauiusers] maui segfaults trying to schedule a job In-Reply-To: References: Message-ID: BTW, I just tried upgrading to maui-3.3.1 and I still have the same issue. Maui segfaults when I try to start the maui process with this one job in the queue. -- Steven DuChene From: DuChene, StevenX A Sent: Monday, November 28, 2011 4:05 PM To: mauiusers at supercluster.org Subject: maui segfaults trying to schedule a job This morning I discovered that the maui scheduler process was not running on one of our clusters like it should. When I try to start the maui process as the maui user I get a segmentation fault. In checking the log files the last few entries look like this: 11/28 15:45:24 MQueueSelectAllJobs(Q,HARD,ALL,JIList,DP,Msg) 11/28 15:45:24 INFO:???? job '231' Priority:????? 605 11/28 15:45:24 INFO:???? Cred:????? 0(00.0)? FS:????? 0(00.0)? Attr:????? 0(00.0)? Serv:??? 605(00.0)? Targ:????? 0(00.0)? Res:????? 0(00.0)? Us:????? 0(00.0) 11/28 15:45:24 MStatClearUsage([NONE],Active) 11/28 15:45:24 INFO:???? total jobs selected (ALL): 1/1 11/28 15:45:24 MQueueSelectAllJobs(Q,SOFT,ALL,JIList,DP,Msg) 11/28 15:45:24 INFO:???? job '231' Priority:????? 605 11/28 15:45:24 INFO:???? Cred:????? 0(00.0)? FS:????? 0(00.0)? Attr:????? 0(00.0)? Serv:??? 605(00.0)? Targ:??? ??0(00.0)? Res:????? 0(00.0)? Us:????? 0(00.0) 11/28 15:45:24 MStatClearUsage([NONE],Idle) 11/28 15:45:24 INFO:???? total jobs selected (ALL): 1/1 11/28 15:45:24 MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,FALSE) 11/28 15:45:24 INFO:???? total jobs selected in partition ALL: 1/1 11/28 15:45:24 MQueueScheduleRJobs(Q) 11/28 15:45:24 MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE) 11/28 15:45:24 INFO:???? total jobs selected in partition ALL: 1/1 11/28 15:45:24 MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE) 11/28 15:45:24 INFO:???? total jobs selected in partition DEFAULT: 1/1 11/28 15:45:24 MQueueScheduleIJobs(Q,DEFAULT) 11/28 15:45:24 INFO:???? 156 feasible tasks found for job 231:0 in partition DEFAULT (39 Needed) 11/28 15:45:24 INFO:???? 156 feasible tasks found for job 231:1 in partition DEFAULT (39 Needed) 11/28 15:45:24 INFO:???? 156 feasible tasks found for job 231:2 in partition DEFAULT (39 Needed) 11/28 15:45:24 INFO:???? 156 feasible tasks found for job 231:3 in partition DEFAULT (39 Needed) 11/28 15:45:24 INFO:???? 156 feasible tasks found for job 231:4 in partition DEFAULT (16 Needed) Prior to the above entries there are a WHOLE BUNCH of entries similar to these shown below: 11/28 15:45:24 MUGetIndex(TJC,ValList,0) 11/28 15:45:24 MUGetIndex(TNJA,ValList,0) 11/28 15:45:24 MUGetIndex(TNJC,ValList,0) 11/28 15:45:24 MUGetIndex(TNXF,ValList,0) 11/28 15:45:24 MUGetIndex(TPSD,ValList,0) 11/28 15:45:24 MUGetIndex(TPSE,ValList,0) 11/28 15:45:24 MUGetIndex(TPSR,ValList,0) 11/28 15:45:24 MUGetIndex(TPSU,ValList,0) 11/28 15:45:24 MUGetIndex(TQM,ValList,0) 11/28 15:45:24 MUGetIndex(TQT,ValList,0) 11/28 15:45:24 MUGetIndex(TRT,ValList,0) 11/28 15:45:24 MUGetIndex(TXF,ValList,0) There is only this one job in the queue on a 256 node cluster running torque 2.5.7 and maui 3.2.6p21 I have tried starting the maui process within strace but I do not see any smoking gun in that strace output. I can probably get maui to start if I qdel the job but I was sort of hoping to see what was causing the problem in case any additional debugging output was needed. -- Steven DuChene From stevenx.a.duchene at intel.com Tue Nov 29 10:06:56 2011 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Tue, 29 Nov 2011 09:06:56 -0800 Subject: [Mauiusers] maui segfaults trying to schedule a job In-Reply-To: <4ED4F87B.9090608@calculquebec.ca> References: <4ED4F87B.9090608@calculquebec.ca> Message-ID: No, the systems in this cluster only have 4 cores per node. BTW, in case this has some bearing on the problem here are the things I have set in my maui.cfg file: RMCFG[MYCLUSTER.ORG] TYPE=PBS AMCFG[bank] TYPE=NONE RMPOLLINTERVAL 00:00:30 SERVERPORT 42559 SERVERMODE NORMAL LOGFILE maui.log LOGFILEMAXSIZE 10000000 LOGLEVEL 3 QUEUETIMEWEIGHT 1 BACKFILLPOLICY FIRSTFIT RESERVATIONPOLICY CURRENTHIGHEST NODEALLOCATIONPOLICY MINRESOURCE ENABLEMULTIREQJOBS TRUE ENABLEMULTINODEJOBS TRUE JOBNODEMATCHPOLICY EXACTNODE The job the user is submitting requests the following: #PBS -l nodes=39:SeaMicro:diskgrp1+39:SeaMicro:diskgrp2+39:SeaMicro:diskgrp3+39:SeaMicro:diskgrp4+16:SeaMicro:diskgrp6 -----Original Message----- From: Michel B?land [mailto:michel.beland at calculquebec.ca] Sent: Tuesday, November 29, 2011 7:22 AM To: DuChene, StevenX A Cc: mauiusers at supercluster.org Subject: Re: [Mauiusers] maui segfaults trying to schedule a job DuChene, StevenX A a ?crit : > > This morning I discovered that the maui scheduler process was not > running on one of our clusters like it should. When I try to start the > maui process as the maui user I get a segmentation fault. In checking > the log files the last few entries look like this: > > > > (...) > > There is only this one job in the queue on a 256 node cluster running > torque 2.5.7 and maui 3.2.6p21 > > > > I have tried starting the maui process within strace but I do not see > any smoking gun in that strace output. > > > > I can probably get maui to start if I qdel the job but I was sort of > hoping to see what was causing the problem in case any additional > debugging output was needed. > > I guess that you have more than 16 cores per node so that your job requests more that 4096 cores. In that case, you have to increase MAX_MTASK in include/msched-common.h and recompile. It hast to be equal or greater than the number of cores in the cluster. You have to watch out also for the size of MAX_MBUFFER and MMAX_BUFFER in include/mcom.h and include/msched-common.h. This is used to define the size of the buffer that contains the string exec_host. For large clusters, it is too small and large jobs will kill Maui after they have started execution. It is good to have short node names for that reason. Other parameters to check are MAX_MNODE, MAX_MCLASS, MAX_MREQ_PER_JOB (or MMAX_REQ_PER_JOB), MAX_MRES, MAX_MNODE_PER_JOB, MAX_MNODE_PER_FRAG, MMAX_JOB. -- Michel B?land, analyste en calcul scientifique michel.beland at calculquebec.ca bureau S-250, pavillon Roger-Gaudry (principal), Universit? de Montr?al t?l?phone : 514 343-6111 poste 3892 t?l?copieur : 514 343-2155 Calcul Qu?bec (www.calculquebec.ca) Calcul Canada (calculcanada.org) From stevenx.a.duchene at intel.com Tue Nov 29 10:40:33 2011 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Tue, 29 Nov 2011 09:40:33 -0800 Subject: [Mauiusers] maui segfaults trying to schedule a job In-Reply-To: <4ED4F87B.9090608@calculquebec.ca> References: <4ED4F87B.9090608@calculquebec.ca> Message-ID: OK, I see in mcom.h MMAX_BUFFER is set to 65536 and MAX_MBUFFER is set to 65536 in msched_common.h Our node names are 8 characters long and this job would be requesting 172 nodes specifically so that would be 1376 characters. -- Steven DuChene -----Original Message----- From: Michel B?land [mailto:michel.beland at calculquebec.ca] Sent: Tuesday, November 29, 2011 7:22 AM To: DuChene, StevenX A Cc: mauiusers at supercluster.org Subject: Re: [Mauiusers] maui segfaults trying to schedule a job DuChene, StevenX A a ?crit : > > This morning I discovered that the maui scheduler process was not > running on one of our clusters like it should. When I try to start the > maui process as the maui user I get a segmentation fault. In checking > the log files the last few entries look like this: > > > > (...) > > There is only this one job in the queue on a 256 node cluster running > torque 2.5.7 and maui 3.2.6p21 > > > > I have tried starting the maui process within strace but I do not see > any smoking gun in that strace output. > > > > I can probably get maui to start if I qdel the job but I was sort of > hoping to see what was causing the problem in case any additional > debugging output was needed. > > I guess that you have more than 16 cores per node so that your job requests more that 4096 cores. In that case, you have to increase MAX_MTASK in include/msched-common.h and recompile. It hast to be equal or greater than the number of cores in the cluster. You have to watch out also for the size of MAX_MBUFFER and MMAX_BUFFER in include/mcom.h and include/msched-common.h. This is used to define the size of the buffer that contains the string exec_host. For large clusters, it is too small and large jobs will kill Maui after they have started execution. It is good to have short node names for that reason. Other parameters to check are MAX_MNODE, MAX_MCLASS, MAX_MREQ_PER_JOB (or MMAX_REQ_PER_JOB), MAX_MRES, MAX_MNODE_PER_JOB, MAX_MNODE_PER_FRAG, MMAX_JOB. -- Michel B?land, analyste en calcul scientifique michel.beland at calculquebec.ca bureau S-250, pavillon Roger-Gaudry (principal), Universit? de Montr?al t?l?phone : 514 343-6111 poste 3892 t?l?copieur : 514 343-2155 Calcul Qu?bec (www.calculquebec.ca) Calcul Canada (calculcanada.org) From stevenx.a.duchene at intel.com Tue Nov 29 11:53:57 2011 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Tue, 29 Nov 2011 10:53:57 -0800 Subject: [Mauiusers] maui segfaults trying to schedule a job (SOLVED) References: <4ED4F87B.9090608@calculquebec.ca> Message-ID: Thanks for the suggestions. I bumped up the following: In mcom.h Changed MMAX_BUFFER from 65536 to 131072 In msched-common.h Changed MAX_MBUFFER from 65536 to 131072 Changed MMAX_BUFFER from 65536 to 131072 Changed MAX_MTASK from 4096 to 8192 In msched.h Changed MAX_MREQ_PER_JOB from 4 to 8 Once I did that the job request would not cause maui to segfault. BTW, isn't there someway that this could just kick out some sort of error into the logs instead of just silently causing a segfault? I spent quite a bit of time looking around in the logs and running maui through strace to see what might be wrong. If it would log something about these constants being too low that would be a real big help in tracking down what needs to be changed. Also if this was a run time change that could be made in a configuration file instead of having to edit include files and recompile would be great. -- Steven DuChene -----Original Message----- From: DuChene, StevenX A Sent: Tuesday, November 29, 2011 9:41 AM To: 'Michel B?land' Cc: mauiusers at supercluster.org Subject: RE: [Mauiusers] maui segfaults trying to schedule a job OK, I see in mcom.h MMAX_BUFFER is set to 65536 and MAX_MBUFFER is set to 65536 in msched_common.h Our node names are 8 characters long and this job would be requesting 172 nodes specifically so that would be 1376 characters. -- Steven DuChene -----Original Message----- From: Michel B?land [mailto:michel.beland at calculquebec.ca] Sent: Tuesday, November 29, 2011 7:22 AM To: DuChene, StevenX A Cc: mauiusers at supercluster.org Subject: Re: [Mauiusers] maui segfaults trying to schedule a job DuChene, StevenX A a ?crit : > > This morning I discovered that the maui scheduler process was not > running on one of our clusters like it should. When I try to start the > maui process as the maui user I get a segmentation fault. In checking > the log files the last few entries look like this: > > > > (...) > > There is only this one job in the queue on a 256 node cluster running > torque 2.5.7 and maui 3.2.6p21 > > > > I have tried starting the maui process within strace but I do not see > any smoking gun in that strace output. > > > > I can probably get maui to start if I qdel the job but I was sort of > hoping to see what was causing the problem in case any additional > debugging output was needed. > > I guess that you have more than 16 cores per node so that your job requests more that 4096 cores. In that case, you have to increase MAX_MTASK in include/msched-common.h and recompile. It hast to be equal or greater than the number of cores in the cluster. You have to watch out also for the size of MAX_MBUFFER and MMAX_BUFFER in include/mcom.h and include/msched-common.h. This is used to define the size of the buffer that contains the string exec_host. For large clusters, it is too small and large jobs will kill Maui after they have started execution. It is good to have short node names for that reason. Other parameters to check are MAX_MNODE, MAX_MCLASS, MAX_MREQ_PER_JOB (or MMAX_REQ_PER_JOB), MAX_MRES, MAX_MNODE_PER_JOB, MAX_MNODE_PER_FRAG, MMAX_JOB. -- Michel B?land, analyste en calcul scientifique michel.beland at calculquebec.ca bureau S-250, pavillon Roger-Gaudry (principal), Universit? de Montr?al t?l?phone : 514 343-6111 poste 3892 t?l?copieur : 514 343-2155 Calcul Qu?bec (www.calculquebec.ca) Calcul Canada (calculcanada.org) From stevenx.a.duchene at intel.com Tue Nov 29 12:13:51 2011 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Tue, 29 Nov 2011 11:13:51 -0800 Subject: [Mauiusers] maui segfaults trying to schedule a job In-Reply-To: <4ED52D9B.10508@calculquebec.ca> References: <4ED4F87B.9090608@calculquebec.ca> <4ED52D9B.10508@calculquebec.ca> Message-ID: Did you also increase MMAX_REQ_PER_JOB as well? -- Steven DuChene -----Original Message----- From: Michel B?land [mailto:michel.beland at calculquebec.ca] Sent: Tuesday, November 29, 2011 11:08 AM To: DuChene, StevenX A Cc: mauiusers at supercluster.org Subject: Re: [Mauiusers] maui segfaults trying to schedule a job Steven DuChene wrote: > No, the systems in this cluster only have 4 cores per node. > > BTW, in case this has some bearing on the problem here are the things I have set in my maui.cfg file: > > > The job the user is submitting requests the following: > > #PBS -l nodes=39:SeaMicro:diskgrp1+39:SeaMicro:diskgrp2+39:SeaMicro:diskgrp3+39:SeaMicro:diskgrp4+16:SeaMicro:diskgrp6 > That's it, you have five node requests with properties and the maximum is four. In your later message you wrote that you changed MAX_MREQ_PER_JOB from 4 to 8. This is what solved the problem. On our cluster, I bumped MAX_MREQ_PER_JOB to 64, but still I think that anybody with malicious intentions could write a dummy job that would kill Maui all the time. -- Michel B?land, analyste en calcul scientifique michel.beland at calculquebec.ca bureau S-250, pavillon Roger-Gaudry (principal), Universit? de Montr?al t?l?phone : 514 343-6111 poste 3892 t?l?copieur : 514 343-2155 Calcul Qu?bec (www.calculquebec.ca) Calcul Canada (calculcanada.org) From stevenx.a.duchene at intel.com Wed Nov 30 11:17:14 2011 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Wed, 30 Nov 2011 10:17:14 -0800 Subject: [Mauiusers] maui segfaults trying to schedule a job (SOLVED) In-Reply-To: <4ED53183.60601@calculquebec.ca> References: <4ED4F87B.9090608@calculquebec.ca> <4ED53183.60601@calculquebec.ca> Message-ID: I will see if I can run maui through a debugger to see where in the code this is dying. My guess is that the places where it dies, depends on which of these various compiled in limits has been exceeded. I might need some assistance trying to figure out how to exceed particular limits since I am not completely sure what each of the limits you suggested actually does. -- Steven DuChene -----Original Message----- From: Michel B?land [mailto:michel.beland at calculquebec.ca] Sent: Tuesday, November 29, 2011 11:25 AM To: DuChene, StevenX A Cc: mauiusers at supercluster.org Subject: Re: [Mauiusers] maui segfaults trying to schedule a job (SOLVED) Steven DuChene wrote: > Thanks for the suggestions. > I bumped up the following: > > In mcom.h > Changed MMAX_BUFFER from 65536 to 131072 > > In msched-common.h > Changed MAX_MBUFFER from 65536 to 131072 > Changed MMAX_BUFFER from 65536 to 131072 > Changed MAX_MTASK from 4096 to 8192 > > In msched.h > Changed MAX_MREQ_PER_JOB from 4 to 8 > > Once I did that the job request would not cause maui to segfault. > > BTW, isn't there someway that this could just kick out some sort of error into the logs instead of just silently causing a segfault? I spent quite a bit of time looking around in the logs and running maui through strace to see what might be wrong. If it would log something about these constants being too low that would be a real big help in tracking down what needs to be changed. > > Also if this was a run time change that could be made in a configuration file instead of having to edit include files and recompile would be great I agree with you. This job is being done in Moab according to someone I talked to at Adaptive Computing's booth at SC2011, but I would not expect them to do it in Maui as it competes with Moab. It will have to be done by the community, that is you, me or someone else... -- Michel B?land, analyste en calcul scientifique michel.beland at calculquebec.ca bureau S-250, pavillon Roger-Gaudry (principal), Universit? de Montr?al t?l?phone : 514 343-6111 poste 3892 t?l?copieur : 514 343-2155 Calcul Qu?bec (www.calculquebec.ca) Calcul Canada (calculcanada.org) From laurent.facq at math.u-bordeaux1.fr Mon Nov 28 02:54:55 2011 From: laurent.facq at math.u-bordeaux1.fr (Laurent Facq) Date: Mon, 28 Nov 2011 10:54:55 +0100 Subject: [Mauiusers] Need to restart maui when memory requirement was too high but fixed Message-ID: <4ED35A6F.3030706@math.u-bordeaux1.fr> Hi ! i'am using open_pbs 3.0.0 + maui 3.3 when users submit job requiring too much memory, torque queue the job but maui is unable to execute it. ok. in checkjob i have : job is deferred. Reason: NoResources (cannot create reservation for job 'xxxxx' (intital reservation attempt) ok Rejection Reasons: [Class : 43][CPU : 57][State : 11] ===> here i'm surprised to see that the job is rejecter by a "CPU" problem. it should be "Memory" ?? when i use qalter to fix (clear) the memory constraint, then "releasehold -a JOBID" + "runjob -c JOBID" the job dont start and gets deferred again :( i see whith checkjob that maui is no awear that memory constraint are removed. the line : Dedicated Resources Per Task: PROCS: 1 MEM: xxxxM SWAP: xxG is unchanged where i expect a change in MEM/SWAP values. if i restart maui, the job start. any idea ? thank you. L. -- Laurent FACQ - +(33)5 4000 6956 Institut de Math?matiques de Bordeaux - Mathrice GDS 2754 From pwzhou at dicp.ac.cn Mon Nov 28 05:59:37 2011 From: pwzhou at dicp.ac.cn (Panwang Zhou) Date: Mon, 28 Nov 2011 20:59:37 +0800 Subject: [Mauiusers] MAXNODE dose not work Message-ID: <2011112820593734498470@dicp.ac.cn> Dear all: Now I am using the torque 3.0.2 and maui 3.3.1 in our two clusters, in one cluster, all the nodes have the same CPU cores (8), I used the MAXPROC to limit the processors per user can use, it works fine. In another cluster, some nodes have 8 CPU cores and some nodes have 16 CPU cores, so I used the MAXNODE to limit the nodes per user can use, however, it does not work, and all the user can submit jobs without any limitation. Following is my maui.cfg: # maui.cfg 3.3.1 SERVERHOST ln001.dicp.ac.cn # primary admin must be first in list ADMIN1 root # Resource Manager Definition RMCFG[LN001.DICP.AC.CN] TYPE=PBS # Allocation Manager Definition AMCFG[bank] TYPE=NONE # full parameter docs at http://supercluster.org/mauidocs/a.fparameters.html # use the 'schedctl -l' command to display current configuration RMPOLLINTERVAL 00:00:30 SERVERPORT 42559 SERVERMODE NORMAL # Admin: http://supercluster.org/mauidocs/a.esecurity.html LOGFILE maui.log LOGFILEMAXSIZE 10000000 LOGLEVEL 3 # Job Priority: http://supercluster.org/mauidocs/5.1jobprioritization.html QUEUETIMEWEIGHT 1 # FairShare: http://supercluster.org/mauidocs/6.3fairshare.html FSPOLICY DEDICATEDPS FSDEPTH 4 FSINTERVAL 72:00:00 FSDECAY 0.75 FSWEIGHT 1 # User Fairshare USERCFG[user1] FSTARGET=20 MAXNODE=8 USERCFG[user2] FSTARGET=7 MAXNODE=2 USERCFG[user3] FSTARGET=70 MAXNODE=35 USERCFG[user4] FSTARGET=3 MAXNODE=1 # Throttling Policies: http://supercluster.org/mauidocs/6.2throttlingpolicies.html # NONE SPECIFIED SYSCFG PLIST=lenovo:dawning NODECFG[ln001] PARTITION=lenovo NODECFG[ln001] PARTITION=lenovo NODECFG[ln002] PARTITION=lenovo NODECFG[ln003] PARTITION=lenovo NODECFG[ln004] PARTITION=lenovo NODECFG[ln005] PARTITION=lenovo NODECFG[ln006] PARTITION=lenovo NODECFG[ln007] PARTITION=lenovo NODECFG[ln008] PARTITION=lenovo NODECFG[ln009] PARTITION=lenovo NODECFG[ln010] PARTITION=lenovo NODECFG[ln011] PARTITION=lenovo NODECFG[ln012] PARTITION=lenovo NODECFG[ln013] PARTITION=lenovo NODECFG[ln014] PARTITION=lenovo NODECFG[ln015] PARTITION=lenovo NODECFG[ln016] PARTITION=lenovo NODECFG[ln017] PARTITION=lenovo NODECFG[ln018] PARTITION=lenovo NODECFG[ln019] PARTITION=lenovo NODECFG[ln020] PARTITION=lenovo NODECFG[dn001] PARTITION=dawning NODECFG[dn002] PARTITION=dawning NODECFG[dn003] PARTITION=dawning NODECFG[dn004] PARTITION=dawning NODECFG[dn005] PARTITION=dawning NODECFG[dn006] PARTITION=dawning NODECFG[dn007] PARTITION=dawning NODECFG[dn008] PARTITION=dawning NODECFG[dn009] PARTITION=dawning NODECFG[dn010] PARTITION=dawning NODECFG[dn011] PARTITION=dawning NODECFG[dn012] PARTITION=dawning NODECFG[dn013] PARTITION=dawning NODECFG[dn014] PARTITION=dawning NODECFG[dn015] PARTITION=dawning NODECFG[dn016] PARTITION=dawning NODECFG[dn017] PARTITION=dawning NODECFG[dn018] PARTITION=dawning NODECFG[dn019] PARTITION=dawning NODECFG[dn020] PARTITION=dawning # Backfill: http://supercluster.org/mauidocs/8.2backfill.html BACKFILLPOLICY FIRSTFIT RESERVATIONPOLICY CURRENTHIGHEST # Node Allocation: http://supercluster.org/mauidocs/5.2nodeallocation.html NODEALLOCATIONPOLICY MINRESOURCE # QOS: http://supercluster.org/mauidocs/7.3qos.html # QOSCFG[hi] PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB # QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE # Standing Reservations: http://supercluster.org/mauidocs/7.1.3standingreservations.html # SRSTARTTIME[test] 8:00:00 # SRENDTIME[test] 17:00:00 # SRDAYS[test] MON TUE WED THU FRI # SRTASKCOUNT[test] 20 # SRMAXTIME[test] 0:30:00 # Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html # USERCFG[DEFAULT] FSTARGET=25.0 # USERCFG[john] PRIORITY=100 FSTARGET=10.0- # GROUPCFG[staff] PRIORITY=1000 QLIST=hi:low QDEF=hi # CLASSCFG[batch] FLAGS=PREEMPTEE # CLASSCFG[interactive] FLAGS=PREEMPTOR ENABLEMUITINODEJOBS TURE ============================================== Panwang Zhou State Key Laboratory of Molecular Reaction Dynamics Dalian Institute of Chemical Physics Chinese Academy of Sciences. Tel: 0411-84379195 Fax: 0411-84675584 =============================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20111128/810213b8/attachment-0001.html From pwzhou at dicp.ac.cn Mon Nov 28 18:37:09 2011 From: pwzhou at dicp.ac.cn (Panwang Zhou) Date: Tue, 29 Nov 2011 09:37:09 +0800 Subject: [Mauiusers] MAXNODE dose not work Message-ID: <2011112909370942342772@dicp.ac.cn> Dear all: Now I am using the torque 3.0.2 and maui 3.3.1 in our two clusters, in one cluster, all the nodes have the same CPU cores (8), I used the MAXPROC to limit the processors per user can use, it works fine. In another cluster, some nodes have 8 CPU cores and some nodes have 16 CPU cores, so I used the MAXNODE to limit the nodes per user can use, however, it does not work, and all the user can submit jobs without any limitation. Following is my maui.cfg: # maui.cfg 3.3.1 SERVERHOST ln001.dicp.ac.cn # primary admin must be first in list ADMIN1 root # Resource Manager Definition RMCFG[LN001.DICP.AC.CN] TYPE=PBS # Allocation Manager Definition AMCFG[bank] TYPE=NONE # full parameter docs at http://supercluster.org/mauidocs/a.fparameters.html # use the 'schedctl -l' command to display current configuration RMPOLLINTERVAL 00:00:30 SERVERPORT 42559 SERVERMODE NORMAL # Admin: http://supercluster.org/mauidocs/a.esecurity.html LOGFILE maui.log LOGFILEMAXSIZE 10000000 LOGLEVEL 3 # Job Priority: http://supercluster.org/mauidocs/5.1jobprioritization.html QUEUETIMEWEIGHT 1 # FairShare: http://supercluster.org/mauidocs/6.3fairshare.html FSPOLICY DEDICATEDPS FSDEPTH 4 FSINTERVAL 72:00:00 FSDECAY 0.75 FSWEIGHT 1 # User Fairshare USERCFG[user1] FSTARGET=20 MAXNODE=8 USERCFG[user2] FSTARGET=7 MAXNODE=2 USERCFG[user3] FSTARGET=70 MAXNODE=35 USERCFG[user4] FSTARGET=3 MAXNODE=1 # Throttling Policies: http://supercluster.org/mauidocs/6.2throttlingpolicies.html # NONE SPECIFIED SYSCFG PLIST=lenovo:dawning NODECFG[ln001] PARTITION=lenovo NODECFG[ln001] PARTITION=lenovo NODECFG[ln002] PARTITION=lenovo NODECFG[ln003] PARTITION=lenovo NODECFG[ln004] PARTITION=lenovo NODECFG[ln005] PARTITION=lenovo NODECFG[ln006] PARTITION=lenovo NODECFG[ln007] PARTITION=lenovo NODECFG[ln008] PARTITION=lenovo NODECFG[ln009] PARTITION=lenovo NODECFG[ln010] PARTITION=lenovo NODECFG[ln011] PARTITION=lenovo NODECFG[ln012] PARTITION=lenovo NODECFG[ln013] PARTITION=lenovo NODECFG[ln014] PARTITION=lenovo NODECFG[ln015] PARTITION=lenovo NODECFG[ln016] PARTITION=lenovo NODECFG[ln017] PARTITION=lenovo NODECFG[ln018] PARTITION=lenovo NODECFG[ln019] PARTITION=lenovo NODECFG[ln020] PARTITION=lenovo NODECFG[dn001] PARTITION=dawning NODECFG[dn002] PARTITION=dawning NODECFG[dn003] PARTITION=dawning NODECFG[dn004] PARTITION=dawning NODECFG[dn005] PARTITION=dawning NODECFG[dn006] PARTITION=dawning NODECFG[dn007] PARTITION=dawning NODECFG[dn008] PARTITION=dawning NODECFG[dn009] PARTITION=dawning NODECFG[dn010] PARTITION=dawning NODECFG[dn011] PARTITION=dawning NODECFG[dn012] PARTITION=dawning NODECFG[dn013] PARTITION=dawning NODECFG[dn014] PARTITION=dawning NODECFG[dn015] PARTITION=dawning NODECFG[dn016] PARTITION=dawning NODECFG[dn017] PARTITION=dawning NODECFG[dn018] PARTITION=dawning NODECFG[dn019] PARTITION=dawning NODECFG[dn020] PARTITION=dawning # Backfill: http://supercluster.org/mauidocs/8.2backfill.html BACKFILLPOLICY FIRSTFIT RESERVATIONPOLICY CURRENTHIGHEST # Node Allocation: http://supercluster.org/mauidocs/5.2nodeallocation.html NODEALLOCATIONPOLICY MINRESOURCE # QOS: http://supercluster.org/mauidocs/7.3qos.html # QOSCFG[hi] PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB # QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE # Standing Reservations: http://supercluster.org/mauidocs/7.1.3standingreservations.html # SRSTARTTIME[test] 8:00:00 # SRENDTIME[test] 17:00:00 # SRDAYS[test] MON TUE WED THU FRI # SRTASKCOUNT[test] 20 # SRMAXTIME[test] 0:30:00 # Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html # USERCFG[DEFAULT] FSTARGET=25.0 # USERCFG[john] PRIORITY=100 FSTARGET=10.0- # GROUPCFG[staff] PRIORITY=1000 QLIST=hi:low QDEF=hi # CLASSCFG[batch] FLAGS=PREEMPTEE # CLASSCFG[interactive] FLAGS=PREEMPTOR ENABLEMUITINODEJOBS TURE ============================================== Panwang Zhou State Key Laboratory of Molecular Reaction Dynamics Dalian Institute of Chemical Physics Chinese Academy of Sciences. Tel: 0411-84379195 Fax: 0411-84675584 =============================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20111129/ad9a7767/attachment-0001.html From pwzhou at dicp.ac.cn Mon Nov 28 22:33:40 2011 From: pwzhou at dicp.ac.cn (Panwang Zhou) Date: Tue, 29 Nov 2011 13:33:40 +0800 Subject: [Mauiusers] MAXNODE does not work Message-ID: <2011112913334070351674@dicp.ac.cn> Dear all: Now I am using the torque 3.0.2 and maui 3.3.1 in our two clusters, in one cluster, all the nodes have the same CPU cores (8), I used the MAXPROC to limit the processors per user can use, it works fine. In another cluster, some nodes have 8 CPU cores and some nodes have 16 CPU cores, so I used the MAXNODE to limit the nodes per user can use, however, it does not work, and all the user can submit jobs without any limitation. Following is my maui.cfg: # maui.cfg 3.3.1 SERVERHOST ln001.dicp.ac.cn # primary admin must be first in list ADMIN1 root # Resource Manager Definition RMCFG[LN001.DICP.AC.CN] TYPE=PBS # Allocation Manager Definition AMCFG[bank] TYPE=NONE # full parameter docs at http://supercluster.org/mauidocs/a.fparameters.html # use the 'schedctl -l' command to display current configuration RMPOLLINTERVAL 00:00:30 SERVERPORT 42559 SERVERMODE NORMAL # Admin: http://supercluster.org/mauidocs/a.esecurity.html LOGFILE maui.log LOGFILEMAXSIZE 10000000 LOGLEVEL 3 # Job Priority: http://supercluster.org/mauidocs/5.1jobprioritization.html QUEUETIMEWEIGHT 1 # FairShare: http://supercluster.org/mauidocs/6.3fairshare.html FSPOLICY DEDICATEDPS FSDEPTH 4 FSINTERVAL 72:00:00 FSDECAY 0.75 FSWEIGHT 1 # User Fairshare USERCFG[user1] FSTARGET=20 MAXNODE=8 USERCFG[user2] FSTARGET=7 MAXNODE=2 USERCFG[user3] FSTARGET=70 MAXNODE=35 USERCFG[user4] FSTARGET=3 MAXNODE=1 # Throttling Policies: http://supercluster.org/mauidocs/6.2throttlingpolicies.html # NONE SPECIFIED SYSCFG PLIST=lenovo:dawning NODECFG[ln001] PARTITION=lenovo NODECFG[ln001] PARTITION=lenovo NODECFG[ln002] PARTITION=lenovo NODECFG[ln003] PARTITION=lenovo NODECFG[ln004] PARTITION=lenovo NODECFG[ln005] PARTITION=lenovo NODECFG[ln006] PARTITION=lenovo NODECFG[ln007] PARTITION=lenovo NODECFG[ln008] PARTITION=lenovo NODECFG[ln009] PARTITION=lenovo NODECFG[ln010] PARTITION=lenovo NODECFG[ln011] PARTITION=lenovo NODECFG[ln012] PARTITION=lenovo NODECFG[ln013] PARTITION=lenovo NODECFG[ln014] PARTITION=lenovo NODECFG[ln015] PARTITION=lenovo NODECFG[ln016] PARTITION=lenovo NODECFG[ln017] PARTITION=lenovo NODECFG[ln018] PARTITION=lenovo NODECFG[ln019] PARTITION=lenovo NODECFG[ln020] PARTITION=lenovo NODECFG[dn001] PARTITION=dawning NODECFG[dn002] PARTITION=dawning NODECFG[dn003] PARTITION=dawning NODECFG[dn004] PARTITION=dawning NODECFG[dn005] PARTITION=dawning NODECFG[dn006] PARTITION=dawning NODECFG[dn007] PARTITION=dawning NODECFG[dn008] PARTITION=dawning NODECFG[dn009] PARTITION=dawning NODECFG[dn010] PARTITION=dawning NODECFG[dn011] PARTITION=dawning NODECFG[dn012] PARTITION=dawning NODECFG[dn013] PARTITION=dawning NODECFG[dn014] PARTITION=dawning NODECFG[dn015] PARTITION=dawning NODECFG[dn016] PARTITION=dawning NODECFG[dn017] PARTITION=dawning NODECFG[dn018] PARTITION=dawning NODECFG[dn019] PARTITION=dawning NODECFG[dn020] PARTITION=dawning # Backfill: http://supercluster.org/mauidocs/8.2backfill.html BACKFILLPOLICY FIRSTFIT RESERVATIONPOLICY CURRENTHIGHEST # Node Allocation: http://supercluster.org/mauidocs/5.2nodeallocation.html NODEALLOCATIONPOLICY MINRESOURCE # QOS: http://supercluster.org/mauidocs/7.3qos.html # QOSCFG[hi] PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB # QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE # Standing Reservations: http://supercluster.org/mauidocs/7.1.3standingreservations.html # SRSTARTTIME[test] 8:00:00 # SRENDTIME[test] 17:00:00 # SRDAYS[test] MON TUE WED THU FRI # SRTASKCOUNT[test] 20 # SRMAXTIME[test] 0:30:00 # Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html # USERCFG[DEFAULT] FSTARGET=25.0 # USERCFG[john] PRIORITY=100 FSTARGET=10.0- # GROUPCFG[staff] PRIORITY=1000 QLIST=hi:low QDEF=hi # CLASSCFG[batch] FLAGS=PREEMPTEE # CLASSCFG[interactive] FLAGS=PREEMPTOR ENABLEMUITINODEJOBS TURE ============================================== Panwang Zhou State Key Laboratory of Molecular Reaction Dynamics Dalian Institute of Chemical Physics Chinese Academy of Sciences. Tel: 0411-84379195 Fax: 0411-84675584 =============================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20111129/fcfc25a9/attachment-0001.html