From Adrian.Sevcenco at cern.ch Fri Dec 2 03:35:06 2011 From: Adrian.Sevcenco at cern.ch (Adrian Sevcenco) Date: Fri, 2 Dec 2011 12:35:06 +0200 Subject: [Mauiusers] admin access to maui server not working Message-ID: <4ED8A9DA.90902@cern.ch> Hi! I have an small annoying problem : i have an server (for glite people - cream ce) which is NOT the torque and maui server. the problem is that for commands to maui i have this : [root at grid03 init.d]# showq ERROR: lost connection to server ERROR: cannot request service (status) connection is ok : [root at grid03 init.d]# telnet grid01.spacescience.ro 40559 Trying 85.120.46.15... Connected to grid01.spacescience.ro (85.120.46.15). Escape character is '^]'. ^] telnet> quit Connection closed. in maui config i have: ADMIN1 root ADMIN3 edginfo rgma edguser ADMINHOSTS grid01.spacescience.ro grid03.spacescience.ro Can anyone advice me how can i debug this? Thank you! Adrian -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3110 bytes Desc: S/MIME Cryptographic Signature Url : http://www.supercluster.org/pipermail/mauiusers/attachments/20111202/c256f950/attachment-0001.bin From m.a.oliveira at coimbra.lip.pt Fri Dec 2 03:39:37 2011 From: m.a.oliveira at coimbra.lip.pt (Miguel Oliveira) Date: Fri, 2 Dec 2011 10:39:37 +0000 Subject: [Mauiusers] admin access to maui server not working In-Reply-To: <4ED8A9DA.90902@cern.ch> References: <4ED8A9DA.90902@cern.ch> Message-ID: Hi, On Dec 2, 2011, at 10:35 , Adrian Sevcenco wrote: > Hi! I have an small annoying problem : > i have an server (for glite people - cream ce) which is NOT the torque and maui server. the problem is that for commands to maui i have this : > [root at grid03 init.d]# showq > ERROR: lost connection to server > ERROR: cannot request service (status) > > connection is ok : > [root at grid03 init.d]# telnet grid01.spacescience.ro 40559 > Trying 85.120.46.15... > Connected to grid01.spacescience.ro (85.120.46.15). > Escape character is '^]'. > ^] > telnet> quit > Connection closed. > > in maui config i have: > ADMIN1 root > ADMIN3 edginfo rgma edguser > ADMINHOSTS grid01.spacescience.ro grid03.spacescience.ro If this is the only thing you have in your maui.cfg then you are missing the most important parameter: SERVERHOST Cheers, MAO > > Can anyone advice me how can i debug this? > Thank you! > Adrian > > > _______________________________________________ > mauiusers mailing list > mauiusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/mauiusers -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 1580 bytes Desc: not available Url : http://www.supercluster.org/pipermail/mauiusers/attachments/20111202/8d16ab46/attachment.bin From Adrian.Sevcenco at cern.ch Fri Dec 2 03:43:56 2011 From: Adrian.Sevcenco at cern.ch (Adrian Sevcenco) Date: Fri, 2 Dec 2011 12:43:56 +0200 Subject: [Mauiusers] admin access to maui server not working In-Reply-To: References: <4ED8A9DA.90902@cern.ch> Message-ID: <4ED8ABEC.7000908@cern.ch> On 12/02/11 12:39, Miguel Oliveira wrote: > Hi, > > On Dec 2, 2011, at 10:35 , Adrian Sevcenco wrote: > >> Hi! I have an small annoying problem : i have an server (for glite >> people - cream ce) which is NOT the torque and maui server. the >> problem is that for commands to maui i have this : [root at grid03 >> init.d]# showq ERROR: lost connection to server ERROR: cannot >> request service (status) >> >> connection is ok : [root at grid03 init.d]# telnet >> grid01.spacescience.ro 40559 Trying 85.120.46.15... Connected to >> grid01.spacescience.ro (85.120.46.15). Escape character is '^]'. >> ^] telnet> quit Connection closed. >> >> in maui config i have: ADMIN1 root ADMIN3 >> edginfo rgma edguser ADMINHOSTS >> grid01.spacescience.ro grid03.spacescience.ro > > If this is the only thing you have in your maui.cfg then you are > missing the most important parameter: > > SERVERHOST of course i have it .. SERVERHOST grid01.spacescience.ro SERVERPORT 40559 but how is this related? Thanks, Adrian -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3110 bytes Desc: S/MIME Cryptographic Signature Url : http://www.supercluster.org/pipermail/mauiusers/attachments/20111202/4a8917d5/attachment.bin From basv at sara.nl Fri Dec 2 04:20:39 2011 From: basv at sara.nl (Bas van der Vlies) Date: Fri, 2 Dec 2011 12:20:39 +0100 Subject: [Mauiusers] admin access to maui server not working In-Reply-To: <4ED8A9DA.90902@cern.ch> References: <4ED8A9DA.90902@cern.ch> Message-ID: Adrian, Did you use the same key for both maui installs, see configure: * --with-key=0444 we use the maui commands on all compute nodes, but we have to make sure that they use the same key. regards On 2 dec. 2011, at 11:35, Adrian Sevcenco wrote: > Hi! I have an small annoying problem : > i have an server (for glite people - cream ce) which is NOT the torque > and maui server. the problem is that for commands to maui i have this : > [root at grid03 init.d]# showq > ERROR: lost connection to server > ERROR: cannot request service (status) > > connection is ok : > [root at grid03 init.d]# telnet grid01.spacescience.ro 40559 > Trying 85.120.46.15... > Connected to grid01.spacescience.ro (85.120.46.15). > Escape character is '^]'. > ^] > telnet> quit > Connection closed. > > in maui config i have: > ADMIN1 root > ADMIN3 edginfo rgma edguser > ADMINHOSTS grid01.spacescience.ro > grid03.spacescience.ro > > Can anyone advice me how can i debug this? > Thank you! > Adrian > > > _______________________________________________ > mauiusers mailing list > mauiusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/mauiusers -- Bas van der Vlies basv at sara.nl From brianm at usc.edu Fri Dec 2 04:44:51 2011 From: brianm at usc.edu (Brian Mendenhall) Date: Fri, 02 Dec 2011 03:44:51 -0800 Subject: [Mauiusers] admin access to maui server not working Message-ID: I think this is very poorly documented issue that happens all the time. A special "key" is hard-coded and compiled into the binaries, and if you do not specify the same key when you compile maui on each of your hosts, it will create a random one for you as it assumes this is a stand-alone install. Another way to solve this problem is to compile on one host, then copy the binaries to all ?of your client nodes, assuming they are all the same architecture. Hope this helps, Brian MendenhallBas van der Vlies wrote:Adrian, Did you use the same key for both maui installs, see configure: ? * --with-key=0444 we use the maui commands on all compute nodes, but we have to make sure that they use the same key. regards On 2 dec. 2011, at 11:35, Adrian Sevcenco wrote: > Hi! I have an small annoying problem : > i have an server (for glite people - cream ce) which is NOT the torque > and maui server. the problem is that for commands to maui i have this : > [root at grid03 init.d]# showq > ERROR:??? lost connection to server > ERROR:??? cannot request service (status) > > connection is ok : > [root at grid03 init.d]# telnet grid01.spacescience.ro 40559 > Trying 85.120.46.15... > Connected to grid01.spacescience.ro (85.120.46.15). > Escape character is '^]'. > ^] > telnet> quit > Connection closed. > > in maui config i have: > ADMIN1????????????????????????? root > ADMIN3????????????????????????? edginfo rgma edguser > ADMINHOSTS????????????????????? grid01.spacescience.ro > grid03.spacescience.ro > > Can anyone advice me how can i debug this? > Thank you! > Adrian > > > _______________________________________________ > mauiusers mailing list > mauiusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/mauiusers -- Bas van der Vlies basv at sara.nl _______________________________________________ mauiusers mailing list mauiusers at supercluster.org http://www.supercluster.org/mailman/listinfo/mauiusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20111202/d5fd9569/attachment.html From Adrian.Sevcenco at cern.ch Fri Dec 2 05:01:37 2011 From: Adrian.Sevcenco at cern.ch (Adrian Sevcenco) Date: Fri, 2 Dec 2011 14:01:37 +0200 Subject: [Mauiusers] admin access to maui server not working In-Reply-To: References: <4ED8A9DA.90902@cern.ch> Message-ID: <4ED8BE21.8040105@cern.ch> On 12/02/11 13:20, Bas van der Vlies wrote: > Adrian, > > Did you use the same key for both maui installs, see configure: > * --with-key=0444 I have no idea! there are the rpms from gLite middleware on server i have : maui-3.2.6p21-snap.1224706197.2.slc4 maui-client-3.2.6p21-snap.1224706197.2.slc4 maui-server-3.2.6p21-snap.1224706197.2.slc4 on client i have : maui-client-3.2.6p21-snap.1234905291.5.el5.x86_64 maui-3.2.6p21-snap.1234905291.5.el5.x86_64 how can i check the key? and why would be such a thing hardcoded????? Thanks! Adrian > > we use the maui commands on all compute nodes, but we have to make sure that they use the same key. > > regards > > On 2 dec. 2011, at 11:35, Adrian Sevcenco wrote: > >> Hi! I have an small annoying problem : >> i have an server (for glite people - cream ce) which is NOT the torque >> and maui server. the problem is that for commands to maui i have this : >> [root at grid03 init.d]# showq >> ERROR: lost connection to server >> ERROR: cannot request service (status) >> >> connection is ok : >> [root at grid03 init.d]# telnet grid01.spacescience.ro 40559 >> Trying 85.120.46.15... >> Connected to grid01.spacescience.ro (85.120.46.15). >> Escape character is '^]'. >> ^] >> telnet> quit >> Connection closed. >> >> in maui config i have: >> ADMIN1 root >> ADMIN3 edginfo rgma edguser >> ADMINHOSTS grid01.spacescience.ro >> grid03.spacescience.ro >> >> Can anyone advice me how can i debug this? >> Thank you! >> Adrian >> >> >> _______________________________________________ >> mauiusers mailing list >> mauiusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/mauiusers > > -- > Bas van der Vlies > basv at sara.nl > > > > -- ---------------------------------------------- Adrian Sevcenco | Institute of Space Sciences - ISS, Romania | adrian.sevcenco at {cern.ch,spacescience.ro} | ---------------------------------------------- -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3110 bytes Desc: S/MIME Cryptographic Signature Url : http://www.supercluster.org/pipermail/mauiusers/attachments/20111202/2c6fe0cb/attachment-0001.bin From arnaubria at pic.es Fri Dec 2 05:04:34 2011 From: arnaubria at pic.es (Arnau Bria) Date: Fri, 2 Dec 2011 13:04:34 +0100 Subject: [Mauiusers] admin access to maui server not working In-Reply-To: <4ED8A9DA.90902@cern.ch> References: <4ED8A9DA.90902@cern.ch> Message-ID: <20111202130434.6799f035@amarrosa.pic.es> On Fri, 2 Dec 2011 12:35:06 +0200 Adrian Sevcenco wrote: > Hi! I have an small annoying problem : Hi, [...] > Can anyone advice me how can i debug this? one the sevrer: strings /usr/sbin/maui | grep '^[0-9][0-9][0-9][0-9][0-9][0-9]$' on client: save key on key.txt showq --keyfile=key.txt > Thank you! > Adrian HTH, Arnau From Adrian.Sevcenco at cern.ch Fri Dec 2 05:12:36 2011 From: Adrian.Sevcenco at cern.ch (Adrian Sevcenco) Date: Fri, 2 Dec 2011 14:12:36 +0200 Subject: [Mauiusers] admin access to maui server not working In-Reply-To: <20111202130434.6799f035@amarrosa.pic.es> References: <4ED8A9DA.90902@cern.ch> <20111202130434.6799f035@amarrosa.pic.es> Message-ID: <4ED8C0B4.7020907@cern.ch> On 12/02/11 14:04, Arnau Bria wrote: > On Fri, 2 Dec 2011 12:35:06 +0200 > Adrian Sevcenco wrote: > >> Hi! I have an small annoying problem : > Hi, > > [...] > >> Can anyone advice me how can i debug this? > > > one the sevrer: > strings /usr/sbin/maui | grep '^[0-9][0-9][0-9][0-9][0-9][0-9]$' ok .. i found : [root at grid01 var]# strings /usr/sbin/maui | grep '^[0-9][0-9][0-9][0-9][0-9][0-9]$' 333333 > on client: > save key on key.txt > showq --keyfile=key.txt [root at grid03 ~]# showq --keyfile=key.txt --host=grid01.spacescience.ro --port=40559 ERROR: lost connection to server ERROR: cannot request service (status) so, still no joy .. Thanks! Adrian -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3110 bytes Desc: S/MIME Cryptographic Signature Url : http://www.supercluster.org/pipermail/mauiusers/attachments/20111202/d75278fd/attachment.bin From arnaubria at pic.es Fri Dec 2 05:16:47 2011 From: arnaubria at pic.es (Arnau Bria) Date: Fri, 2 Dec 2011 13:16:47 +0100 Subject: [Mauiusers] admin access to maui server not working In-Reply-To: <4ED8C0B4.7020907@cern.ch> References: <4ED8A9DA.90902@cern.ch> <20111202130434.6799f035@amarrosa.pic.es> <4ED8C0B4.7020907@cern.ch> Message-ID: <20111202131647.173730e6@amarrosa.pic.es> On Fri, 2 Dec 2011 14:12:36 +0200 Adrian Sevcenco wrote: Hi Adrian, > ok .. i found : > [root at grid01 var]# strings /usr/sbin/maui | grep > '^[0-9][0-9][0-9][0-9][0-9][0-9]$' > 333333 this? # strings /usr/sbin/maui | grep '^[0-9][0-9][0-9][0-9][0-9]$' gives some output? (one [0-9] less) I remeber to have some problems with this strings command when running glite maui version. > > on client: > > save key on key.txt > > showq --keyfile=key.txt > [root at grid03 ~]# showq --keyfile=key.txt > --host=grid01.spacescience.ro --port=40559 > ERROR: lost connection to server > ERROR: cannot request service (status) what does the server log say? > so, still no joy .. > Thanks! > Adrian Cheers, Arnau From arnaubria at pic.es Fri Dec 2 05:18:24 2011 From: arnaubria at pic.es (Arnau Bria) Date: Fri, 2 Dec 2011 13:18:24 +0100 Subject: [Mauiusers] admin access to maui server not working In-Reply-To: <4ED8C0B4.7020907@cern.ch> References: <4ED8A9DA.90902@cern.ch> <20111202130434.6799f035@amarrosa.pic.es> <4ED8C0B4.7020907@cern.ch> Message-ID: <20111202131824.1146c3d9@amarrosa.pic.es> Hi, > [root at grid03 ~]# showq --keyfile=key.txt could you try as edginfo user? any firewall between both hosts? is maui replying to showq always (we have some delays when maui is scheduling) Cheers, Arnau From Adrian.Sevcenco at cern.ch Fri Dec 2 05:26:37 2011 From: Adrian.Sevcenco at cern.ch (Adrian Sevcenco) Date: Fri, 2 Dec 2011 14:26:37 +0200 Subject: [Mauiusers] admin access to maui server not working In-Reply-To: <20111202131647.173730e6@amarrosa.pic.es> References: <4ED8A9DA.90902@cern.ch> <20111202130434.6799f035@amarrosa.pic.es> <4ED8C0B4.7020907@cern.ch> <20111202131647.173730e6@amarrosa.pic.es> Message-ID: <4ED8C3FD.6040506@cern.ch> On 12/02/11 14:16, Arnau Bria wrote: > On Fri, 2 Dec 2011 14:12:36 +0200 > Adrian Sevcenco wrote: > > Hi Adrian, > >> ok .. i found : >> [root at grid01 var]# strings /usr/sbin/maui | grep >> '^[0-9][0-9][0-9][0-9][0-9][0-9]$' >> 333333 > > this? > # strings /usr/sbin/maui | grep '^[0-9][0-9][0-9][0-9][0-9]$' > gives some output? (one [0-9] less) I remeber to have some problems > with this strings command when running glite maui version. i tried to take out a 3 .. but no joy .. > >>> on client: >>> save key on key.txt >>> showq --keyfile=key.txt >> [root at grid03 ~]# showq --keyfile=key.txt >> --host=grid01.spacescience.ro --port=40559 >> ERROR: lost connection to server >> ERROR: cannot request service (status) > what does the server log say? 12/02 14:22:08 ALERT: checksum does not match (e44b82ee72a17768:98322eb87f8331cf) request 'TS=1322828527 AUTH=root DT=CMD=showq AUTH=root ARG=0 ALL 0 between this 2 hosts there is no firewall. as edginfo the same problems appear thanks!! Adrian -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3110 bytes Desc: S/MIME Cryptographic Signature Url : http://www.supercluster.org/pipermail/mauiusers/attachments/20111202/92c12a0d/attachment.bin From arnaubria at pic.es Fri Dec 2 07:04:17 2011 From: arnaubria at pic.es (Arnau Bria) Date: Fri, 2 Dec 2011 15:04:17 +0100 Subject: [Mauiusers] admin access to maui server not working In-Reply-To: <4ED8C3FD.6040506@cern.ch> References: <4ED8A9DA.90902@cern.ch> <20111202130434.6799f035@amarrosa.pic.es> <4ED8C0B4.7020907@cern.ch> <20111202131647.173730e6@amarrosa.pic.es> <4ED8C3FD.6040506@cern.ch> Message-ID: <20111202150417.7de173a3@amarrosa.pic.es> Sorry, sent to Adrian only... Try this key: 26756 # rpm -qa|grep maui-server maui-server-3.2.6p21-snap.1224706197.2.slc4 # strings /usr/sbin/maui | grep '^[0-9][0-9][0-9][0-9][0-9]$' 26756 # arnau at amarrosa ~ $ cat key.txt 26756 arnau at amarrosa ~ $ showq --host=wms01.pic.es --port=40559 --keyfile=key.txt ACTIVE JOBS-------------------- JOBNAME USERNAME STATE PROC REMAINING STARTTIME 0 Active Jobs 0 of 0 Processors Active (0.00%) IDLE JOBS---------------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME 0 Idle Jobs BLOCKED JOBS---------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME Total Jobs: 0 Active Jobs: 0 Idle Jobs: 0 Blocked Jobs: 0 HTH, Arnau From shaomin.hu at gmail.com Fri Dec 2 07:57:04 2011 From: shaomin.hu at gmail.com (Shaomin Hu) Date: Fri, 2 Dec 2011 09:57:04 -0500 Subject: [Mauiusers] Maui not run jobs larger than nodes=256:ppn=16 Message-ID: We are running Maui 3.3.1. We have 648 nodes defined in our system and each node has 16 cores. However, we could not get jobs requested -lnodes=400:ppn=16 to run run. "check job" shows the following, Holds: Defer Messages: exceeds available partition procs PE: 6400.00 StartPriority: 2478 cannot select job 1051 for partition DEFAULT (job hold active) We don't have the partitions defined in Maui. Here are the major configurations, [root at carter-adm hu8]# qstat -a 1051 carter-adm.rcac.purdue.edu: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- 1051.carter-adm. mluisier workq P6400 -- 400 640 -- 02:00 Q -- [root at carter-adm hu8]# checkjob 1051 checking job 1051 State: Idle Creds: user:mluisier group:ece class:workq qos:DEFAULT WallTime: 00:00:00 of 2:00:00 SubmitTime: Mon Nov 21 09:07:36 (Time Queued Total: 11:00:40:20 Eligible: 00:00:00) Total Tasks: 6400 Req[0] TaskCount: 6400 Partition: ALL Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [carter] IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 0 PartitionMask: [ALL] Flags: RESTARTABLE Holds: Defer Messages: exceeds available partition procs PE: 6400.00 StartPriority: 2483 cannot select job 1051 for partition DEFAULT (job hold active) [root at carter-adm hu8]# [root at carter-adm maui]# cat maui.cfg # maui.cfg 3.3.1 SERVERHOST carter-adm.rcac.purdue.edu # primary admin must be first in list ADMIN1 root # Resource Manager Definition RMCFG[CARTER-ADM.RCAC.PURDUE.EDU] TYPE=PBS # Allocation Manager Definition AMCFG[bank] TYPE=NONE # full parameter docs at http://supercluster.org/mauidocs/a.fparameters.html # use the 'schedctl -l' command to display current configuration RMPOLLINTERVAL 00:00:30 SERVERPORT 42559 SERVERMODE NORMAL # Admin: http://supercluster.org/mauidocs/a.esecurity.html LOGFILE maui.log LOGFILEMAXSIZE 10000000 LOGLEVEL 3 # Job Priority: http://supercluster.org/mauidocs/5.1jobprioritization.html QUEUETIMEWEIGHT 1 #QUEUETIME 0 # FairShare: http://supercluster.org/mauidocs/6.3fairshare.html #FSPOLICY PSDEDICATED #FSDEPTH 7 #FSINTERVAL 86400 #FSDECAY 0.80 # Throttling Policies: http://supercluster.org/mauidocs/6.2throttlingpolicies.html # NONE SPECIFIED # Backfill: http://supercluster.org/mauidocs/8.2backfill.html #BACKFILLPOLICY FIRSTFIT #RESERVATIONPOLICY CURRENTHIGHEST # Node Allocation: http://supercluster.org/mauidocs/5.2nodeallocation.html NODEALLOCATIONPOLICY MINRESOURCE # QOS: http://supercluster.org/mauidocs/7.3qos.html # QOSCFG[hi] PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB # QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE # Standing Reservations: http://supercluster.org/mauidocs/7.1.3standingreservations.html # SRSTARTTIME[test] 8:00:00 # SRENDTIME[test] 17:00:00 # SRDAYS[test] MON TUE WED THU FRI # SRTASKCOUNT[test] 20 # SRMAXTIME[test] 0:30:00 # Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html # USERCFG[DEFAULT] FSTARGET=25.0 # USERCFG[john] PRIORITY=100 FSTARGET=10.0- # GROUPCFG[staff] PRIORITY=1000 QLIST=hi:low QDEF=hi # CLASSCFG[batch] FLAGS=PREEMPTEE # CLASSCFG[interactive] FLAGS=PREEMPTOR ###### following added by Shaomin ###### CREDWEIGHT 1 CLASSWEIGHT 1 CLASSCFG[batch] PRIORITY=10 CLASSCFG[test] PRIORITY=1000 MAXPROC=32 MAXJOB=3 #CLASSCFG[test] PRIORITY=1000 MAXPROC=32 MAXJOB=3 HOSTLIST=carter-a638,carter-a639 SRCFG[test1] ACCESS=DEDICATED SRCFG[test1] PERIOD=INFINITY SRCFG[test1] CLASSLIST=test SRCFG[test1] HOSTLIST=carter-a638,carter-a639 #QOSWEIGHT 1 #QOSCFG[batch] PRIORITY=10 #QOSCFG[test] PRIORITY=1000 [root at carter-adm maui]# [root at carter-adm maui]# qstat -Qf workq Queue: workq queue_type = Execution total_jobs = 1 state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 Exiting:0 resources_max.neednodes = carter resources_max.walltime = 168:00:00 resources_default.neednodes = carter resources_default.nodes = 1 resources_default.walltime = 04:00:00 mtime = 1322577475 enabled = True started = True [root at carter-adm maui]# [root at carter-adm maui]# pbsnodes carter-a000 carter-a000 state = free np = 16 properties = carter ntype = cluster status = rectime=1322837521,varattr=,jobs=,state=free,netload=680204799,gres=,loadave=0.07,ncpus=16,physmem=32841344kb,availmem=48525956kb,totmem=49618552kb,idletime=82118,nusers=0,nsessions=0,uname=Linux carter-a000.rcac.purdue.edu 2.6.32-131.12.1.el6.x86_64 #1 SMP Sun Jul 31 16:44:56 EDT 2011 x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 [root at carter-adm maui]# Nodes definition examples, carter-a000 np=16 carter ..... carter-a647 np=16 carter We are able to run jobs sized less than -lnodes=256:ppn=16. Thanks for all help. Shaomin -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20111202/82b0870f/attachment-0001.html From gus at ldeo.columbia.edu Fri Dec 2 10:10:28 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Fri, 2 Dec 2011 12:10:28 -0500 Subject: [Mauiusers] MAXNODE dose not work In-Reply-To: <2011112820593734498470@dicp.ac.cn> References: <2011112820593734498470@dicp.ac.cn> Message-ID: <25B45DD4-4A7C-4440-BC4B-C3DAEDB085B6@ldeo.columbia.edu> It may not solve the problem, but there is a typo in your maui.cfg file: ENABLEMUITINODEJOBS TURE It should be: ENABLEMUITINODEJOBS TRUE I hope this helps, Gus Correa On Nov 28, 2011, at 7:59 AM, Panwang Zhou wrote: > Dear all: > > Now I am using the torque 3.0.2 and maui 3.3.1 in our two clusters, in one cluster, all the nodes have the same CPU cores (8), I used the MAXPROC to limit the processors per user can use, it works fine. In another cluster, some nodes have 8 CPU cores and some nodes have 16 CPU cores, so I used the MAXNODE to limit the nodes per user can use, however, it does not work, and all the user can submit jobs without any limitation. Following is my maui.cfg: > > # maui.cfg 3.3.1 > > SERVERHOST ln001.dicp.ac.cn > # primary admin must be first in list > ADMIN1 root > > # Resource Manager Definition > > RMCFG[LN001.DICP.AC.CN] TYPE=PBS > > # Allocation Manager Definition > > AMCFG[bank] TYPE=NONE > > # full parameter docs at http://supercluster.org/mauidocs/a.fparameters.html > # use the 'schedctl -l' command to display current configuration > > RMPOLLINTERVAL 00:00:30 > > SERVERPORT 42559 > SERVERMODE NORMAL > > # Admin: http://supercluster.org/mauidocs/a.esecurity.html > > > LOGFILE maui.log > LOGFILEMAXSIZE 10000000 > LOGLEVEL 3 > > # Job Priority: http://supercluster.org/mauidocs/5.1jobprioritization.html > > QUEUETIMEWEIGHT 1 > > # FairShare: http://supercluster.org/mauidocs/6.3fairshare.html > > FSPOLICY DEDICATEDPS > FSDEPTH 4 > FSINTERVAL 72:00:00 > FSDECAY 0.75 > FSWEIGHT 1 > > # User Fairshare > USERCFG[user1] FSTARGET=20 MAXNODE=8 > USERCFG[user2] FSTARGET=7 MAXNODE=2 > USERCFG[user3] FSTARGET=70 MAXNODE=35 > USERCFG[user4] FSTARGET=3 MAXNODE=1 > > # Throttling Policies: http://supercluster.org/mauidocs/6.2throttlingpolicies.html > > # NONE SPECIFIED > SYSCFG PLIST=lenovo:dawning > NODECFG[ln001] PARTITION=lenovo > NODECFG[ln001] PARTITION=lenovo > NODECFG[ln002] PARTITION=lenovo > NODECFG[ln003] PARTITION=lenovo > NODECFG[ln004] PARTITION=lenovo > NODECFG[ln005] PARTITION=lenovo > NODECFG[ln006] PARTITION=lenovo > NODECFG[ln007] PARTITION=lenovo > NODECFG[ln008] PARTITION=lenovo > NODECFG[ln009] PARTITION=lenovo > NODECFG[ln010] PARTITION=lenovo > NODECFG[ln011] PARTITION=lenovo > NODECFG[ln012] PARTITION=lenovo > NODECFG[ln013] PARTITION=lenovo > NODECFG[ln014] PARTITION=lenovo > NODECFG[ln015] PARTITION=lenovo > NODECFG[ln016] PARTITION=lenovo > NODECFG[ln017] PARTITION=lenovo > NODECFG[ln018] PARTITION=lenovo > NODECFG[ln019] PARTITION=lenovo > NODECFG[ln020] PARTITION=lenovo > NODECFG[dn001] PARTITION=dawning > NODECFG[dn002] PARTITION=dawning > NODECFG[dn003] PARTITION=dawning > NODECFG[dn004] PARTITION=dawning > NODECFG[dn005] PARTITION=dawning > NODECFG[dn006] PARTITION=dawning > NODECFG[dn007] PARTITION=dawning > NODECFG[dn008] PARTITION=dawning > NODECFG[dn009] PARTITION=dawning > NODECFG[dn010] PARTITION=dawning > NODECFG[dn011] PARTITION=dawning > NODECFG[dn012] PARTITION=dawning > NODECFG[dn013] PARTITION=dawning > NODECFG[dn014] PARTITION=dawning > NODECFG[dn015] PARTITION=dawning > NODECFG[dn016] PARTITION=dawning > NODECFG[dn017] PARTITION=dawning > NODECFG[dn018] PARTITION=dawning > NODECFG[dn019] PARTITION=dawning > NODECFG[dn020] PARTITION=dawning > > # Backfill: http://supercluster.org/mauidocs/8.2backfill.html > > BACKFILLPOLICY FIRSTFIT > RESERVATIONPOLICY CURRENTHIGHEST > > # Node Allocation: http://supercluster.org/mauidocs/5.2nodeallocation.html > > NODEALLOCATIONPOLICY MINRESOURCE > > # QOS: http://supercluster.org/mauidocs/7.3qos.html > > # QOSCFG[hi] PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB > # QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE > > # Standing Reservations: http://supercluster.org/mauidocs/7.1.3standingreservations.html > > # SRSTARTTIME[test] 8:00:00 > # SRENDTIME[test] 17:00:00 > # SRDAYS[test] MON TUE WED THU FRI > # SRTASKCOUNT[test] 20 > # SRMAXTIME[test] 0:30:00 > > # Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html > > # USERCFG[DEFAULT] FSTARGET=25.0 > # USERCFG[john] PRIORITY=100 FSTARGET=10.0- > # GROUPCFG[staff] PRIORITY=1000 QLIST=hi:low QDEF=hi > # CLASSCFG[batch] FLAGS=PREEMPTEE > # CLASSCFG[interactive] FLAGS=PREEMPTOR > ENABLEMUITINODEJOBS TURE > > ============================================== > Panwang Zhou > State Key Laboratory of Molecular Reaction Dynamics > Dalian Institute of Chemical Physics > Chinese Academy of Sciences. > Tel: 0411-84379195 Fax: 0411-84675584 > =============================================== > > _______________________________________________ > mauiusers mailing list > mauiusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/mauiusers From Adrian.Sevcenco at cern.ch Fri Dec 2 14:22:04 2011 From: Adrian.Sevcenco at cern.ch (Adrian Sevcenco) Date: Fri, 2 Dec 2011 23:22:04 +0200 Subject: [Mauiusers] admin access to maui server not working In-Reply-To: <20111202150417.7de173a3@amarrosa.pic.es> References: <4ED8A9DA.90902@cern.ch> <20111202130434.6799f035@amarrosa.pic.es> <4ED8C0B4.7020907@cern.ch> <20111202131647.173730e6@amarrosa.pic.es> <4ED8C3FD.6040506@cern.ch> <20111202150417.7de173a3@amarrosa.pic.es> Message-ID: <4ED9417C.5090107@cern.ch> On 12/02/11 16:04, Arnau Bria wrote: > Sorry, sent to Adrian only... > > > Try this key: > > 26756 GREAT!!! IT WORKED !!! now just for the rant : wtf with this key???? why is such thing in existence? WHY IS HARDCODED???? Thank you! Adrian > # rpm -qa|grep maui-server > maui-server-3.2.6p21-snap.1224706197.2.slc4 > # strings /usr/sbin/maui | grep '^[0-9][0-9][0-9][0-9][0-9]$' > 26756 > # > > arnau at amarrosa ~ $ cat key.txt > 26756 > arnau at amarrosa ~ $ showq --host=wms01.pic.es --port=40559 > --keyfile=key.txt ACTIVE JOBS-------------------- > JOBNAME USERNAME STATE PROC REMAINING > STARTTIME > > > 0 Active Jobs 0 of 0 Processors Active (0.00%) > > IDLE JOBS---------------------- > JOBNAME USERNAME STATE PROC WCLIMIT > QUEUETIME > > > 0 Idle Jobs > > BLOCKED JOBS---------------- > JOBNAME USERNAME STATE PROC WCLIMIT > QUEUETIME > > > Total Jobs: 0 Active Jobs: 0 Idle Jobs: 0 Blocked Jobs: 0 > > > HTH, > Arnau -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3110 bytes Desc: S/MIME Cryptographic Signature Url : http://www.supercluster.org/pipermail/mauiusers/attachments/20111202/ed6b3369/attachment.bin From Adrian.Sevcenco at cern.ch Fri Dec 2 15:25:46 2011 From: Adrian.Sevcenco at cern.ch (Adrian Sevcenco) Date: Sat, 3 Dec 2011 00:25:46 +0200 Subject: [Mauiusers] CLIENTCFG usage In-Reply-To: <4ED9417C.5090107@cern.ch> References: <4ED8A9DA.90902@cern.ch> <20111202130434.6799f035@amarrosa.pic.es> <4ED8C0B4.7020907@cern.ch> <20111202131647.173730e6@amarrosa.pic.es> <4ED8C3FD.6040506@cern.ch> <20111202150417.7de173a3@amarrosa.pic.es> <4ED9417C.5090107@cern.ch> Message-ID: <4ED9506A.2080208@cern.ch> So .. my saga continues .. now i want the access to be automatically The description from : http://www.adaptivecomputing.com/resources/docs/maui/a.fparameters.php#clientcfg http://www.adaptivecomputing.com/resources/docs/maui/a.esecurity.php http://www.adaptivecomputing.com/resources/docs/maui/13.2rmconfiguration.php its totally unclear! So .. 1. is CLIENTCFG set only on client? 2. with which arguments? from Appendix E: Security i deduced that it has to be of form : [root at grid03 maui]# cat maui-private.cfg CLIENTCFG[RM:base] CSKEY=26756 but its not working! there is no explication as what argument CLIENTCFG take! i have this configuration (maui.cfg is identical on server and client): SERVERNAME grid01 SERVERHOST grid01.spacescience.ro ADMIN1 root ADMIN3 edginfo rgma edguser ADMINHOSTS grid01.spacescience.ro grid03.spacescience.ro RMCFG[base] TYPE=PBS SERVERPORT 40559 SERVERMODE NORMAL can someone enlighten me what argument should CLIENTCFG take? with explicit command is ok: showq --keyfile=key.txt Thanks! Adrian -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3110 bytes Desc: S/MIME Cryptographic Signature Url : http://www.supercluster.org/pipermail/mauiusers/attachments/20111203/bee74436/attachment.bin From deepak.soni at tatatechnologies.com Sun Dec 4 21:06:02 2011 From: deepak.soni at tatatechnologies.com (deepak.soni at tatatechnologies.com) Date: Mon, 5 Dec 2011 09:36:02 +0530 Subject: [Mauiusers] FW: regarding dynamic license fetching Message-ID: <2C27F99B8AD2A74F9B6AA6C94B854ED4A80825@TTPNEMSG04.ttglobal.tatatechnologies.com> FYI ________________________________ From: Soni Deepak [Application Engineer] Sent: Monday, December 05, 2011 9:34 AM To: 'developers at supercluster.org' Subject: FW: regarding dynamic license fetching Importance: High FYI ________________________________ From: Soni Deepak [Application Engineer] Sent: Monday, December 05, 2011 9:29 AM To: 'mauiusers-owner at supercluster.org' Subject: regarding dynamic license fetching Importance: High Dear Sir/Madam, i am using maui for last one year but i do not know how can i set dynamic license fetching as a resource for jobs, my requirement is just each jobs will first look license availability for software and will run, it is possible on altair PBSPro so kindly help me for how can i set, i just wanted to integrate a license checking script in maui if script will return number then jobs will run otherwise jobs will be on queue i need your support, Thanking you. Regards Deepak Soni ************************************************************************************************************************************************** Email Disclaimer: Information contained and transmitted by this e-mail (including any attachments) is confidential, proprietary and legally privileged data of Tata Technologies that is intended for use only by the addressee. If you are not the intended recipient, you are notified that any review, use, dissemination, distribution, copying or printing of this e-mail is strictly prohibited. You are requested to delete this e-mail or any copies immediately and notify the sender by reply email. Internet communications cannot be guaranteed to be timely, secure, error or virus-free. Tata Technologies does not accept any liability for virus infected email or errors or omissions or consequences which may arise as a result of this e-mail transmission. To know more about Tata Technologies please visit http://www.tatatechnologies.com ************************************************************************************************************************************************ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20111205/12f0f990/attachment-0001.html From jpeltier at sfu.ca Mon Dec 5 10:27:29 2011 From: jpeltier at sfu.ca (James A. Peltier) Date: Mon, 5 Dec 2011 09:27:29 -0800 (PST) Subject: [Mauiusers] FW: regarding dynamic license fetching In-Reply-To: <2C27F99B8AD2A74F9B6AA6C94B854ED4A80825@TTPNEMSG04.ttglobal.tatatechnologies.com> Message-ID: <1526960086.353172.1323106049770.JavaMail.root@jaguar10.sfu.ca> Maui does not contain support for FlexLM licensing. You need Moab for that. However, Maui does have support for up to three license products to keep track of via the GRES parameter, but does not reliably check if there is actually a license available. NODECFG[GLOBAL] GRES=matlab:80 Here we limit the cluster to 80 Matlab licenses running concurrently so that we don't eat up all the licenses. However, let's say we only have 120 licenses and 90 of them are checked out... Maui would schedule up to 80 jobs because there aren't 80 scheduled but only 30 of the jobs would continue to run because subsequent license check outs would fail. ----- Original Message ----- | FYI | | ________________________________ | | From: Soni Deepak [Application Engineer] | Sent: Monday, December 05, 2011 9:34 AM | To: 'developers at supercluster.org' | Subject: FW: regarding dynamic license fetching | Importance: High | | | | FYI | | | | ________________________________ | | From: Soni Deepak [Application Engineer] | Sent: Monday, December 05, 2011 9:29 AM | To: 'mauiusers-owner at supercluster.org' | Subject: regarding dynamic license fetching | Importance: High | | | | Dear Sir/Madam, | | i am using maui for last one year but i do not | know how can i set dynamic license fetching as a resource for jobs, my | requirement is just each jobs will first look license availability for | software and will run, it is possible on altair PBSPro so kindly help | me | for how can i set, i just wanted to integrate a license checking | script | in maui if script will return number then jobs will run otherwise jobs | will be on queue | | | | i need your support, Thanking you. | | | | Regards | | Deepak Soni | | | | ************************************************************************************************************************************************** | Email Disclaimer: | | Information contained and transmitted by this e-mail (including any | attachments) is confidential, proprietary and legally privileged data | of Tata Technologies that is intended for use only by the addressee. | If you are not the intended recipient, you are notified that any | review, use, dissemination, distribution, copying or printing of this | e-mail is strictly prohibited. You are requested to delete this e-mail | or any copies immediately and notify the sender by reply email. | Internet communications cannot be guaranteed to be timely, secure, | error or virus-free. Tata Technologies does not accept any liability | for virus infected email or errors or omissions or consequences which | may arise as a result of this e-mail transmission. To know more about | Tata Technologies please visit http://www.tatatechnologies.com | ************************************************************************************************************************************************ | | _______________________________________________ | mauiusers mailing list | mauiusers at supercluster.org | http://www.supercluster.org/mailman/listinfo/mauiusers -- James A. Peltier Manager, IT Services - Research Computing Group Simon Fraser University - Burnaby Campus Phone : 778-782-6573 Fax : 778-782-3045 E-Mail : jpeltier at sfu.ca Website : http://www.sfu.ca/itservices http://blogs.sfu.ca/people/jpeltier I will do the best I can with the talent I have From deepak.soni at tatatechnologies.com Mon Dec 5 20:41:07 2011 From: deepak.soni at tatatechnologies.com (deepak.soni at tatatechnologies.com) Date: Tue, 6 Dec 2011 09:11:07 +0530 Subject: [Mauiusers] FW: regarding dynamic license fetching In-Reply-To: <1526960086.353172.1323106049770.JavaMail.root@jaguar10.sfu.ca> Message-ID: <2C27F99B8AD2A74F9B6AA6C94B854ED4A808EF@TTPNEMSG04.ttglobal.tatatechnologies.com> Thank you very much for such significant information and as per your suggestion moab is supporting flexlm licensing, how can i get it and what is the price for that? regards Deepak -----Original Message----- From: James A. Peltier [mailto:jpeltier at sfu.ca] Sent: Monday, December 05, 2011 10:57 PM To: Soni Deepak [Application Engineer] Cc: mauiusers at supercluster.org Subject: Re: [Mauiusers] FW: regarding dynamic license fetching Maui does not contain support for FlexLM licensing. You need Moab for that. However, Maui does have support for up to three license products to keep track of via the GRES parameter, but does not reliably check if there is actually a license available. NODECFG[GLOBAL] GRES=matlab:80 Here we limit the cluster to 80 Matlab licenses running concurrently so that we don't eat up all the licenses. However, let's say we only have 120 licenses and 90 of them are checked out... Maui would schedule up to 80 jobs because there aren't 80 scheduled but only 30 of the jobs would continue to run because subsequent license check outs would fail. ----- Original Message ----- | FYI | | ________________________________ | | From: Soni Deepak [Application Engineer] | Sent: Monday, December 05, 2011 9:34 AM | To: 'developers at supercluster.org' | Subject: FW: regarding dynamic license fetching | Importance: High | | | | FYI | | | | ________________________________ | | From: Soni Deepak [Application Engineer] | Sent: Monday, December 05, 2011 9:29 AM | To: 'mauiusers-owner at supercluster.org' | Subject: regarding dynamic license fetching | Importance: High | | | | Dear Sir/Madam, | | i am using maui for last one year but i do not | know how can i set dynamic license fetching as a resource for jobs, my | requirement is just each jobs will first look license availability for | software and will run, it is possible on altair PBSPro so kindly help | me | for how can i set, i just wanted to integrate a license checking | script | in maui if script will return number then jobs will run otherwise jobs | will be on queue | | | | i need your support, Thanking you. | | | | Regards | | Deepak Soni | | | | ************************************************************************ ************************************************************************ ** | Email Disclaimer: | | Information contained and transmitted by this e-mail (including any | attachments) is confidential, proprietary and legally privileged data | of Tata Technologies that is intended for use only by the addressee. | If you are not the intended recipient, you are notified that any | review, use, dissemination, distribution, copying or printing of this | e-mail is strictly prohibited. You are requested to delete this e-mail | or any copies immediately and notify the sender by reply email. | Internet communications cannot be guaranteed to be timely, secure, | error or virus-free. Tata Technologies does not accept any liability | for virus infected email or errors or omissions or consequences which | may arise as a result of this e-mail transmission. To know more about | Tata Technologies please visit http://www.tatatechnologies.com | ************************************************************************ ************************************************************************ | | _______________________________________________ | mauiusers mailing list | mauiusers at supercluster.org | http://www.supercluster.org/mailman/listinfo/mauiusers -- James A. Peltier Manager, IT Services - Research Computing Group Simon Fraser University - Burnaby Campus Phone : 778-782-6573 Fax : 778-782-3045 E-Mail : jpeltier at sfu.ca Website : http://www.sfu.ca/itservices http://blogs.sfu.ca/people/jpeltier I will do the best I can with the talent I have ************************************************************************************************************************************************** Email Disclaimer: Information contained and transmitted by this e-mail (including any attachments) is confidential, proprietary and legally privileged data of Tata Technologies that is intended for use only by the addressee. If you are not the intended recipient, you are notified that any review, use, dissemination, distribution, copying or printing of this e-mail is strictly prohibited. You are requested to delete this e-mail or any copies immediately and notify the sender by reply email. Internet communications cannot be guaranteed to be timely, secure, error or virus-free. Tata Technologies does not accept any liability for virus infected email or errors or omissions or consequences which may arise as a result of this e-mail transmission. To know more about Tata Technologies please visit http://www.tatatechnologies.com ************************************************************************************************************************************************ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20111206/40631788/attachment-0001.html From grid-admin at mpib-berlin.mpg.de Tue Dec 6 04:49:08 2011 From: grid-admin at mpib-berlin.mpg.de (Grid Admin) Date: Tue, 6 Dec 2011 11:49:08 +0000 Subject: [Mauiusers] FW: regarding dynamic license fetching In-Reply-To: <1526960086.353172.1323106049770.JavaMail.root@jaguar10.sfu.ca> References: <1526960086.353172.1323106049770.JavaMail.root@jaguar10.sfu.ca> Message-ID: <4EDE012E.9060509@mpib-berlin.mpg.de> On 05.12.2011 18:27, James A. Peltier wrote: > Maui does not contain support for FlexLM licensing. You need Moab for that. However, Maui does have support for up to three license products to keep track of via the GRES parameter, but does not reliably check if there is actually a license available. > > NODECFG[GLOBAL] GRES=matlab:80 > Here we limit the cluster to 80 Matlab licenses running concurrently so that we don't eat up all the licenses. However, let's say we only have 120 licenses and 90 of them are checked out... Maui would schedule up to 80 jobs because there aren't 80 scheduled but only 30 of the jobs would continue to run because subsequent license check outs would fail. So assuming there is fixed number of licenses for only the cluster available the GRES approach would work, correct? Out of curiosity: Could we not just create a new queue and set max_running to the number of licenses? I fail to see any difference to generic resources. One would simply change the Matlab PBS markup to use "qsub -q matlab" instead of "qsub -l other=matlab", no? From stevenx.a.duchene at intel.com Tue Dec 6 12:54:34 2011 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Tue, 6 Dec 2011 11:54:34 -0800 Subject: [Mauiusers] maui segfault in gdb but not out of debugger Message-ID: So I am trying to narrow down where in the code the segfault occurs when some of the compiled in max constant values are exceeded. I was hoping to see what the code in those areas looked like but I seem to be running into an unusual problem with running maui inside gdb. I have two versions of maui compiled. I have a stock version that sagfaults when a certain type of work load is submitted. The second version has various compiled in constants adjusted so the segfault does not occur with the same job submission. I compiled maui with -g and -O0 so the symbols were there and there were no optimizations. So I loaded the default segfault version into gdb and set a breakpoint where some log messages were repeated for each task occurrence. I started the application with the following options: maui -d --configfile=/usr/local/maui/maui.cfg --loglevel=10 Then in another window I submit the problem job that causes the segfault and then step through the loops to get to the point where the segfault occurs. Once I got to see where in the code the segfault would occur inside gdb, I exited gdb and I went and one at a time adjusted the various max constants and recompiled maui. Each time I made one single adjustment I would reload my new maui binary into gdb and repeat the process. I continued this process until I got to the point where I had the same max constants settings for the binary that worked successfully outside of gdb. I expected at some point that the maui binary with the right adjustments would NOT segfault inside gdb however this is not the case. No matter what adjustments I made maui would still segfault at the same place when I ran it inside gdb and submitted the problem job. This occurs at line 803 of src/moab/MQueue.c If I then took that same adjusted binary that had the successful settings for the max constants and ran it normally outside gdb, it does NOT segfault when that job is submitted. So is there some special magic to get maui to run inside a gdb debugger environment so it will act like it does outside of gdb? I was researching on google and I see posts where things segfault outside of gdb but not inside gdb. I was not able to find any references of the opposite condition, where something ran normally outside of gdb but segfaulted when run inside the debugger. Any thoughts or suggestions? -- Steven DuChene -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20111206/3a97f17d/attachment.html From steve at isc.tamu.edu Tue Dec 6 14:18:47 2011 From: steve at isc.tamu.edu (Steve Johnson) Date: Tue, 06 Dec 2011 15:18:47 -0600 Subject: [Mauiusers] maui not resuming preempted jobs Message-ID: <4EDE86B7.1010907@isc.tamu.edu> I've observed the following problem: 1) preemptible job "A" from our "background" Torque queue is running on a node. This job has specified -l nodes=1:ppn=8, a full node. 2) job A is sent SIGSTOP and preempting job B from the "high" queue starts. It has also specified -l nodes=1:ppn=8. 3) job B is short lived and exits before the next maui iteration 4) on the next iteration job A is *not* sent a SIGCONT to resume 5) instead queued job C with -l nodes=1:ppn=8 from the background queue is started on the same node. Job B is gone and job A is still suspended. The expected behavior, of course, is that job A would be resumed. Here are some of the vitals from maui.cfg: '\' added for readability RMPOLLINTERVAL 00:00:45 NODEACCESSPOLICY SHARED NODEAVAILABILITYPOLICY COMBINED NODEALLOCATIONPOLICY PRIORITY PREEMPTPOLICY SUSPEND BACKFILLPOLICY BESTFIT BACKFILLMETRIC PROCSECONDS BFCHUNKDURATION 05:00 BFCHUNKSIZE 4 RESERVATIONPOLICY CURRENTHIGHEST RESERVATIONDEPTH[0] 256 RESERVATIONQOSLIST[0] preemptor RESERVATIONDEPTH[1] 384 RESERVATIONQOSLIST[1] background DEFERTIME 0 DEFERCOUNT 1000 QOSCFG[background] QFLAGS=DEDICATED:PREEMPTEE QOSCFG[preemptor] QFLAGS=DEDICATED:PREEMPTOR CLASSCFG[background] QDEF=background PRIORITY=9 \ MAXPROC=2240 MAXNODE=280 CLASSCFG[high] QDEF=preemptor PRIORITY=10000 \ MAXPROC=1024,1440 MAXNODE=128,180 Notice that the default NODEACCESSPOLICY is SHARED, but the QFLAGS should override this with DEDICATED for the two QOSCFG's. We have other queues that omit QFLAGS=DEDICATED and pack np single-core jobs onto a node. I can't say this is fully reproducible, but it a short-lived job can trigger this behavior. The total number of jobs across all queues is approximately 600. Maui 3.3.1. Any ideas? Any thoughts on further diagnosis? The log level is currently somewhat low, so I'm only seeing the MRMJob{Start,Suspend,Resume} actions. // Steve From govind.rhul at gmail.com Thu Dec 8 05:41:13 2011 From: govind.rhul at gmail.com (Govind B. Songara) Date: Thu, 8 Dec 2011 12:41:13 +0000 Subject: [Mauiusers] max jobs per group per node Message-ID: Just re-posting this question to maui list, sorry if you got it twice I am looking an option to configure max jobs per group per node. Is it possible in maui? Thanks Govind -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20111208/e75dfd44/attachment.html From WJEdsall at dow.com Thu Dec 8 07:43:30 2011 From: WJEdsall at dow.com (Edsall, William (WJ)) Date: Thu, 8 Dec 2011 09:43:30 -0500 Subject: [Mauiusers] queue priorities preventing utilization Message-ID: <52CD990A674498429E6A7B4FCAE3F7D3081BBA@USMDLMDOWX025.dow.com> Hello list, We have an issue where two queues (gaussian, overflow) which both have a list of queued jobs. The gaussian queue with more priority is preventing the overflow queue from running (jobs stay in Q status), although the resources that gaussian is requesting are available. The primary queue gaussian is asking for a particular resource that the overflow queue is not asking for. It's still preventing overflow from submitting jobs. I can manually qrun one of the queued jobs, and they submit just fine. Overflow in this case is a Preemptee, but gaussian is not a preemptor. I dont feel like preemption is causing this. Thanks for the help! Will From WJEdsall at dow.com Thu Dec 8 09:56:44 2011 From: WJEdsall at dow.com (Edsall, William (WJ)) Date: Thu, 8 Dec 2011 11:56:44 -0500 Subject: [Mauiusers] queue priorities preventing utilization References: <52CD990A674498429E6A7B4FCAE3F7D3081BBA@USMDLMDOWX025.dow.com> Message-ID: <52CD990A674498429E6A7B4FCAE3F7D3081BC8@USMDLMDOWX025.dow.com> We may have just answered our own question. Backfill policy was definitely causing the issue, we had it set to NONE. The reason we had it set to NONE was because of the order backfilled jobs were being picked up after they had been suspended then resumed. Will the BESTFIT policy attempt to start jobs from top-to-bottom of the queue? For instance if a Suspended job becomes eligible again, should it have priority? ________________________________ From: mauiusers-bounces at supercluster.org on behalf of Edsall, William (WJ) Sent: Thu 12/8/2011 9:43 AM To: mauiusers at supercluster.org Subject: [Mauiusers] queue priorities preventing utilization Hello list, We have an issue where two queues (gaussian, overflow) which both have a list of queued jobs. The gaussian queue with more priority is preventing the overflow queue from running (jobs stay in Q status), although the resources that gaussian is requesting are available. The primary queue gaussian is asking for a particular resource that the overflow queue is not asking for. It's still preventing overflow from submitting jobs. I can manually qrun one of the queued jobs, and they submit just fine. Overflow in this case is a Preemptee, but gaussian is not a preemptor. I dont feel like preemption is causing this. Thanks for the help! Will _______________________________________________ mauiusers mailing list mauiusers at supercluster.org http://www.supercluster.org/mailman/listinfo/mauiusers From WJEdsall at dow.com Wed Dec 14 13:57:52 2011 From: WJEdsall at dow.com (Edsall, William (WJ)) Date: Wed, 14 Dec 2011 15:57:52 -0500 Subject: [Mauiusers] queue priority including wait time? Message-ID: <52CD990A674498429E6A7B4FCAE3F7D307E0A571@USMDLMDOWX025.dow.com> Hello list!, We have been using a Preemptee queue called overflow. The queue has a priority of 100 while the other queues default to 10000 priority. A user had queued up extra jobs in overflow, and it these queued jobs seemed to block any new priority queue jobs. Could it have been because they were in the queue so long, their priority went up? How can we disable this for a specific queue? If this is a fairshare setting we are using fairshare defaults. Here is the checkjob info for a job that was blocking: checking job 1132 State: Idle Creds: user:joe group:users class:overflow qos:overflow WallTime: 00:00:00 of 9:04:00:00 SubmitTime: Wed Dec 7 16:31:03 (Time Queued Total: 6:23:20:15 Eligible: 6:23:19:20) Total Tasks: 12 Req[0] TaskCount: 12 Partition: ALL Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [eth] IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 0 PartitionMask: [ALL] Flags: RESTARTABLE PREEMPTEE Attr: PREEMPTEE PE: 12.00 StartPriority: 10038 job cannot run in partition DEFAULT (insufficient idle procs available: 0 < 12) Thanks! Will -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20111214/e26aa8fd/attachment.html From tbaer at utk.edu Wed Dec 14 14:11:59 2011 From: tbaer at utk.edu (Troy Baer) Date: Wed, 14 Dec 2011 16:11:59 -0500 Subject: [Mauiusers] queue priority including wait time? In-Reply-To: <52CD990A674498429E6A7B4FCAE3F7D307E0A571@USMDLMDOWX025.dow.com> References: <52CD990A674498429E6A7B4FCAE3F7D307E0A571@USMDLMDOWX025.dow.com> Message-ID: <1323897119.2542.464.camel@browncoat.jics.utk.edu> On Wed, 2011-12-14 at 15:57 -0500, Edsall, William (WJ) wrote: > We have been using a Preemptee queue called overflow. The queue has a > priority of 100 while the other queues default to 10000 priority. > > A user had queued up extra jobs in overflow, and it these queued jobs > seemed to block any new priority queue jobs. Could it have been > because they were in the queue so long, their priority went up? How > can we disable this for a specific queue? If this is a fairshare > setting we are using fairshare defaults. What does "diagnose -p" say? --Troy -- Troy Baer, HPC System Administrator National Institute for Computational Sciences, University of Tennessee http://www.nics.tennessee.edu/ Phone: 865-241-4233 From WJEdsall at dow.com Wed Dec 14 14:16:33 2011 From: WJEdsall at dow.com (Edsall, William (WJ)) Date: Wed, 14 Dec 2011 16:16:33 -0500 Subject: [Mauiusers] queue priority including wait time? In-Reply-To: <1323897119.2542.464.camel@browncoat.jics.utk.edu> References: <52CD990A674498429E6A7B4FCAE3F7D307E0A571@USMDLMDOWX025.dow.com> <1323897119.2542.464.camel@browncoat.jics.utk.edu> Message-ID: <52CD990A674498429E6A7B4FCAE3F7D307E0A5A3@USMDLMDOWX025.dow.com> Unfortunately I had to clear them out to fit some jobs in.. but I have a similar comparison: ]# diagnose -p | more diagnosing job priority information (partition: ALL) Job PRIORITY* Cred( QOS:Class) Serv(QTime) Weights -------- 1( 1: 1) 1( 1) 258693 10076 1.0( 0.0:100.0) 99.0(9975.) 258704 10076 1.0( 0.0:100.0) 99.0(9975.) 258705 10076 1.0( 0.0:100.0) 99.0(9975.) 258708 10076 1.0( 0.0:100.0) 99.0(9975.) 258714 10076 1.0( 0.0:100.0) 99.0(9975.) 258474 9949 1.0( 0.0:100.0) 99.0(9849.) 258463 9810 1.0( 0.0:100.0) 99.0(9709.) These top jobs with 10,000+ are preempted, originally had 100 priority. Currently they are suspended. -----Original Message----- From: Troy Baer [mailto:tbaer at utk.edu] Sent: Wednesday, December 14, 2011 4:12 PM To: Edsall, William (WJ) Cc: mauiusers at supercluster.org Subject: Re: [Mauiusers] queue priority including wait time? On Wed, 2011-12-14 at 15:57 -0500, Edsall, William (WJ) wrote: > We have been using a Preemptee queue called overflow. The queue has a > priority of 100 while the other queues default to 10000 priority. > > A user had queued up extra jobs in overflow, and it these queued jobs > seemed to block any new priority queue jobs. Could it have been > because they were in the queue so long, their priority went up? How > can we disable this for a specific queue? If this is a fairshare > setting we are using fairshare defaults. What does "diagnose -p" say? --Troy -- Troy Baer, HPC System Administrator National Institute for Computational Sciences, University of Tennessee http://www.nics.tennessee.edu/ Phone: 865-241-4233 From tbaer at utk.edu Wed Dec 14 14:53:30 2011 From: tbaer at utk.edu (Troy Baer) Date: Wed, 14 Dec 2011 16:53:30 -0500 Subject: [Mauiusers] queue priority including wait time? In-Reply-To: <52CD990A674498429E6A7B4FCAE3F7D307E0A5A3@USMDLMDOWX025.dow.com> References: <52CD990A674498429E6A7B4FCAE3F7D307E0A571@USMDLMDOWX025.dow.com> <1323897119.2542.464.camel@browncoat.jics.utk.edu> <52CD990A674498429E6A7B4FCAE3F7D307E0A5A3@USMDLMDOWX025.dow.com> Message-ID: <1323899610.2542.480.camel@browncoat.jics.utk.edu> On Wed, 2011-12-14 at 16:16 -0500, Edsall, William (WJ) wrote: > Unfortunately I had to clear them out to fit some jobs in.. but I have > a similar comparison: > > > ]# diagnose -p | more > diagnosing job priority information (partition: ALL) > > Job PRIORITY* Cred( QOS:Class) Serv(QTime) > Weights -------- 1( 1: 1) 1( 1) > > 258693 10076 1.0( 0.0:100.0) 99.0(9975.) > 258704 10076 1.0( 0.0:100.0) 99.0(9975.) > 258705 10076 1.0( 0.0:100.0) 99.0(9975.) > 258708 10076 1.0( 0.0:100.0) 99.0(9975.) > 258714 10076 1.0( 0.0:100.0) 99.0(9975.) > 258474 9949 1.0( 0.0:100.0) 99.0(9849.) > 258463 9810 1.0( 0.0:100.0) 99.0(9709.) > > These top jobs with 10,000+ are preempted, originally had 100 > priority. Currently they are suspended. IIRC, the QUEUETIMEWEIGHT factor is multiplied by the number of minutes the job has been eligible to run, and the default value for it is 1. There are a couple different thing you could do here: * Set QUEUETIMEWEIGHT to 0, which has the potentially unfortunate side effect of not giving any priority to *any* jobs which have been sitting for a long time. * Set CREDWEIGHT to something larger than 1 (e.g. 10 or 100) to reduce the relative effect of jobs' queue wait time relative to the priority factor from the class. * Use a QOS to adjust QUEUETIMEWEIGHT for the jobs in question. (See http://www.adaptivecomputing.com/resources/docs/maui/5.1.2priorityfactors.php#queuetimesubcomponent for details.) --Troy -- Troy Baer, HPC System Administrator National Institute for Computational Sciences, University of Tennessee http://www.nics.tennessee.edu/ Phone: 865-241-4233 From steve at isc.tamu.edu Wed Dec 14 15:04:50 2011 From: steve at isc.tamu.edu (Steve Johnson) Date: Wed, 14 Dec 2011 16:04:50 -0600 Subject: [Mauiusers] queue priority including wait time? In-Reply-To: <1323899610.2542.480.camel@browncoat.jics.utk.edu> References: <52CD990A674498429E6A7B4FCAE3F7D307E0A571@USMDLMDOWX025.dow.com> <1323897119.2542.464.camel@browncoat.jics.utk.edu> <52CD990A674498429E6A7B4FCAE3F7D307E0A5A3@USMDLMDOWX025.dow.com> <1323899610.2542.480.camel@browncoat.jics.utk.edu> Message-ID: <4EE91D82.7000103@isc.tamu.edu> You can also cap the queuetime weight with QUEUETIMECAP so that it won't exceed the priority of the preempting queue. http://www.adaptivecomputing.com/resources/docs/maui/a.fparameters.php // Steve On 12/14/2011 03:53 PM, Troy Baer wrote: > On Wed, 2011-12-14 at 16:16 -0500, Edsall, William (WJ) wrote: >> Unfortunately I had to clear them out to fit some jobs in.. but I have >> a similar comparison: >> >> >> ]# diagnose -p | more >> diagnosing job priority information (partition: ALL) >> >> Job PRIORITY* Cred( QOS:Class) Serv(QTime) >> Weights -------- 1( 1: 1) 1( 1) >> >> 258693 10076 1.0( 0.0:100.0) 99.0(9975.) >> 258704 10076 1.0( 0.0:100.0) 99.0(9975.) >> 258705 10076 1.0( 0.0:100.0) 99.0(9975.) >> 258708 10076 1.0( 0.0:100.0) 99.0(9975.) >> 258714 10076 1.0( 0.0:100.0) 99.0(9975.) >> 258474 9949 1.0( 0.0:100.0) 99.0(9849.) >> 258463 9810 1.0( 0.0:100.0) 99.0(9709.) >> >> These top jobs with 10,000+ are preempted, originally had 100 >> priority. Currently they are suspended. > > IIRC, the QUEUETIMEWEIGHT factor is multiplied by the number of minutes > the job has been eligible to run, and the default value for it is 1. > > There are a couple different thing you could do here: > > * Set QUEUETIMEWEIGHT to 0, which has the potentially unfortunate side > effect of not giving any priority to *any* jobs which have been sitting > for a long time. > > * Set CREDWEIGHT to something larger than 1 (e.g. 10 or 100) to reduce > the relative effect of jobs' queue wait time relative to the priority > factor from the class. > > * Use a QOS to adjust QUEUETIMEWEIGHT for the jobs in question. (See > http://www.adaptivecomputing.com/resources/docs/maui/5.1.2priorityfactors.php#queuetimesubcomponent for details.) > > --Troy From bunk at physik.hu-berlin.de Thu Dec 15 07:46:35 2011 From: bunk at physik.hu-berlin.de (Burkhard Bunk) Date: Thu, 15 Dec 2011 15:46:35 +0100 (CET) Subject: [Mauiusers] hostlist corruption with ENABLEMULTIREQJOBS TRUE Message-ID: Hi, I tried to start a 32-node job on 7 quadcore and 2 dualcore machines: qsub -l nodes=7:ppn=4+2:ppn=2 ... With torque's FIFO scheduler (pbs_sched), the job starts as expected. With maui, I have ENABLEMULTIREQJOBS TRUE but the job gets deferred and will never start. The reason is revealed by an error message in maui.log: 12/14 13:24:31 ERROR: job '40' cannot be started: (rc: 15064 errmsg: 'Unknown node ' hostlist: 'dong2:ppn=6+dong3:ppn=4+dong4:ppn=6 +gdong1:ppn=4+gdong2:ppn=4+gdong3:ppn=4+gdong4:ppn=4') 12/14 13:24:31 ALERT: cannot start job 40 (RM 'base' failed in function 'jobstart') The "hostlist" correctly lists the 7 quadcore hosts, but instead of adding the 2 dualcores, it overloads two of the quadcores ("ppn=6"). The bug is seen with torque-2.5.9 and maui-3.3 as well as maui-3.3.1. Testing with "smaller" requests like qsub -l nodes=4:ppn=4+2:ppn=2 does indeed work in the same configuration. Maybe hostlists of a certain size/complexity are needed to trigger the buggy behavior? Please let me know if you need more info for debugging the case. Best regards, Burkhard Bunk. ---------------------------------------------------------------------- bunk at physik.hu-berlin.de Physics Institute, Humboldt University fax: ++49-30 2093 7628 Newtonstr. 15 phone: ++49-30 2093 7980 12489 Berlin, Germany ---------------------------------------------------------------------- From listsarnau at gmail.com Tue Dec 20 09:08:27 2011 From: listsarnau at gmail.com (Arnau Bria) Date: Tue, 20 Dec 2011 17:08:27 +0100 Subject: [Mauiusers] limiting resource usage with torque Message-ID: <20111220170827.2dc3da6c@amarrosa.pic.es> Hi all, first of all let me apologise for cross-posting. I've already asked this question at torque mailing list but I got not reply. As this is torque/MAUI related I believe that asking it here is not list off-topic. We were testing how to limit and request resource usage with torque. Doc, and some docs I found on the net, said that defining resources_max at queue level is enough for limitng resource usage: * pag 62 of torque doc v 3.0.0 resource_max Specifies the maximum resource limits for jobs submitted to the queue So, we did something like : resources_max.vmem=6gb Also, after configuring 'size [fs=/home]' on all nodes, we added some default resource request (disk free space) at submitfilter level: line="#PBS -l file=30gb -c n" from mnan: -l resource_list Defines the resources that are required by the job and establishes a limit to the amount of resource that can be consumed. jobs were submitted with : Resource_List.file = 30gb Resource_List.neednodes = 1 Resource_List.nodect = 1 Resource_List.nodes = 1 Resource_List.pvmem = 6000mb which seemed to work fine, but after some jobs started running, we noticed that nodes were not running all the jobs they were supposed to, although being in free state. I.e, a node with 24gb os mem (PHYS+SWAP) using only 12gb of mem did not run more than 4 jobs when 8 was its limit. So, if it had free resources why is it not running more jobs? After some debugging we found the source. MAUI was reserving 6gb of mem for each job. so, 4 jobs*6gb of mem = 24gb. All the mem was reserved for those 4 jobs and the node is not selected for running more. from checknode: [...] Configured Resources: PROCS: 8 MEM: 15G SWAP: 23G DISK: 122G Utilized Resources: SWAP: 5048M DISK: 35G Dedicated Resources: PROCS: 4 SWAP: 23G DISK: 30G [...] And we suppose that something similar was going to happen with DISK resource if more jobs start (yep, we have some node with low disk space). So, did we understand correctly the resource.max parameter and -l qsub option? Why that maui resource reservation? Maybe this question should go to maui list, but for not double-posting (yet), may we avoid maui reservation of resources? How are other admins limiting VMEM usage per job? How may we request some disk space available? Many thanks in advance, and specially to them who read till here ;-) Cheers, Arnau