[Mauiusers] maui with gold problems...

Dave Jackson jacksond at clusterresources.com
Fri Dec 16 13:13:00 MST 2005


Jen,

  Can you run Maui under gdb.  (See section 14.1.4 of the online docs)

  When the failure occurs, please issue 'where' and send us the output.
We will also attempt to reproduce this locally.

Dave

On Fri, 2005-12-16 at 14:51 -0500, Aquarijen wrote:
> Hi All,
> 
> I am not sure if this is a gold question or a maui question - so I am
> posting to both - I hope that is ok...
> Sorry for so many questions lately!  So, I made sure that no users on
> the test cluster have usernames begining with a number.  I have gold
> running and I have accounts, projects, machines and users set up with
> 100000000 deposited to each gold account.
> If I configure maui to use gold as its AM, maui pretty much instantly
> dies.  I am using maui 3.2.6p13 and gold version 2.0.0.4.  I cleared
> out the checkpoint file.  I shut everything down and cleared the
> queue.  I then started gold, then maui, then pbs_server and then the
> pbs_moms.  Maui dies.  I've tried this in different orders, too.  Maui
> dies if I have the AMCFG line included.
> 
> Here is my simple maui.cfg:
> 
> # maui.cfg 3.2
> SERVERHOST              b05l02
> ADMIN1                root tippensjl
> RMCFG[base]  TYPE=PBS
> JOBAGGREGATIONTIME      00:00:10
> RMPOLLINTERVAL  00:00:30
> DOWNNODEDELAYTIME       72:00:00
> SERVERPORT            42559
> SERVERMODE            NORMAL
> LOGFILE               maui.log
> LOGFILEMAXSIZE        100000000
> LOGLEVEL              9
> QUEUETIMEWEIGHT[0]       10
> FSPOLICY              DEDICATEDPS
> FSDEPTH               7
> FSINTERVAL            24:00:00
> FSWEIGHT              1
> FSDECAY               0.80
> BACKFILLPOLICY  ON
> BACKFILLTYPE    BESTFIT
> RESERVATIONPOLICY     CURRENTHIGHEST
> NODEACCESSPOLICY        SHARED
> JOBMAXSTARTTIME         2:00:00
> JOBMAXOVERRUN           0:30:00
> AMCFG[bank] TYPE=GOLD HOST=b05l02 PORT=7112 SOCKETPROTOCOL=HTTP
> WIRE-PROTOCOL=XML CHARGEPOLICY=DEBITALLWC JOBFAILUREACTION=NONE
> FLUSHINTERVAL=12:00:00 TIMEOUT=15
> 
> And here is my maui-private.cfg:
> CLIENTCFG[AM:bank] CSKEY=sss CSALGO=HMAC
> 
> And here is the last little bit of my maui.log.  I have loglevel turned up to 9.
> 
> 12/16 14:32:42 MUserAdd(UName,UP)
> 12/16 14:32:42 MUGetHash(tippensjl)
> 12/16 14:32:42 INFO:     hash 'tippensjl' --> 550228005
> 12/16 14:32:42 MUGetHash(tippensjl)
> 12/16 14:32:42 INFO:     hash 'tippensjl' --> 550228005
> 12/16 14:32:42 MCPRestore(USER,tippensjl,Optr)
> 12/16 14:32:42 INFO:     no checkpoint entry for object 'USER         
>        tippensjl '
> 12/16 14:32:42 INFO:     user tippensjl added
> 12/16 14:32:42 INFO:     PBS attribute 'job_state'  value: 'Q'  (r: NULL)
> 12/16 14:32:42 INFO:     PBS attribute 'queue'  value: 'workq'  (r: NULL)
> 12/16 14:32:42 MReqSetAttr(44,RQ,ReqClass,Value,1,2)
> 12/16 14:32:42 INFO:     job flags for job 44: 0
> 12/16 14:32:42 MJobSetAttr(44,GAttr,Value,1,5)
> 12/16 14:32:42 MUMAGetBM(JFeature,PREEMPTEE,3)
> 12/16 14:32:42 INFO:     attribute 'PREEMPTEE' cleared for job 44
> 12/16 14:32:42 MJobGetPAL(44,RPAL,PAL,NULL)
> 12/16 14:32:42 INFO:     PBS attribute 'server'  value: 'b05l02'  (r: NULL)
> 12/16 14:32:42 INFO:     PBS attribute 'Checkpoint'  value: 'u'  (r: NULL)
> 12/16 14:32:42 INFO:     PBS attribute 'ctime'  value: '1134761206'  (r: NULL)
> 12/16 14:32:42 INFO:     PBS attribute 'Error_Path'  value:
> 'b05l02:/home/2vt/jenstests/schulth/Science/dms/GaN/Cube2x2x2_1Mn_NCC/sic_11111/jen-b5.e44'
>  (r: NULL)
> 12/16 14:32:42 INFO:     PBS attribute 'Hold_Types'  value: 'n'  (r: NULL)
> 12/16 14:32:42 INFO:     PBS attribute 'Join_Path'  value: 'n'  (r: NULL)
> 12/16 14:32:42 INFO:     PBS attribute 'Keep_Files'  value: 'n'  (r: NULL)
> 12/16 14:32:42 INFO:     PBS attribute 'Mail_Points'  value: 'ae'  (r: NULL)
> 12/16 14:32:42 INFO:     PBS attribute 'Mail_Users'  value:
> 'tippensjl at ornl.gov'  (r: NULL)
> 12/16 14:32:42 INFO:     PBS attribute 'mtime'  value: '1134761206'  (r: NULL)
> 12/16 14:32:42 INFO:     PBS attribute 'Output_Path'  value:
> 'b05l02:/home/2vt/jenstests/schulth/Science/dms/GaN/Cube2x2x2_1Mn_NCC/sic_11111/jen-b5.o44'
>  (r: NULL)
> 12/16 14:32:42 INFO:     PBS attribute 'Priority'  value: '0'  (r: NULL)
> 12/16 14:32:42 INFO:     PBS attribute 'qtime'  value: '1134761206'  (r: NULL)
> 12/16 14:32:42 INFO:     PBS attribute 'Rerunable'  value: 'True'  (r: NULL)
> 12/16 14:32:42 INFO:     PBS attribute 'Resource_List'  value:
> '10000:00:00'  (r: cput)
> 12/16 14:32:42 INFO:     PBS attribute 'Resource_List'  value: '1'  (r: ncpus)
> 12/16 14:32:42 INFO:     PBS attribute 'Resource_List'  value:
> '30:ppn=2'  (r: neednodes)
> 12/16 14:32:42 __MPBSGetTaskList(44,30:ppn=2,NULL,0)
> 12/16 14:32:42 MReqSetAttr(44,RQ,ReqNodeFeature,Value,1,2)
> 12/16 14:32:42 INFO:     0 host task(s) located for job
> 12/16 14:32:42 INFO:     PBS attribute 'Resource_List'  value: '30' 
> (r: nodect)12/16 14:32:42 INFO:     PBS attribute 'Resource_List' 
> value: '30:ppn=2'  (r: nodes)
> 12/16 14:32:42 INFO:     processing node request line '30:ppn=2'
> 12/16 14:32:42 __MPBSGetTaskList(44,30:ppn=2,NULL,0)
> 12/16 14:32:42 MReqSetAttr(44,RQ,ReqNodeFeature,Value,1,2)
> 12/16 14:32:42 INFO:     0 host task(s) located for job
> 12/16 14:32:42 INFO:     PBS attribute 'Resource_List'  value:
> '10000:00:00'  (r: walltime)
> 12/16 14:32:42 INFO:     PBS attribute 'Shell_Path_List'  value:
> '/bin/bash'  (r: NULL)
> 12/16 14:32:42 INFO:     PBS attribute 'substate'  value: '10'  (r: NULL)
> 12/16 14:32:42 INFO:     PBS attribute 'Variable_List'  value:
> 'PBS_O_HOME=/home/2vt,PBS_O_LANG=en_US.UTF-8,PBS_O_LOGNAME=tippensjl,PBS_O_PATH=/opt/intel/cce/9.0/bin:/opt/intel/fce/9.0/bin:/usr/kerberos/bin:/opt/mpich-ch_p4-icc-1.2.7/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/opt/pbs/bin:/opt/pbs/lib/xpbs/bin:/opt/env-switcher/bin:/opt/kernel_picker/bin:/opt/pvm3/lib:/opt/pvm3/lib/LINUX:/opt/pvm3/bin/LINUX:/usr/local/apitest:/opt/c3-4/:/home/2vt/bin,PBS_O_MAIL=/var/spool/mail/tippensjl,PBS_O_SHELL=/bin/bash,PBS_O_HOST=b05l02,PBS_O_WORKDIR=/home/2vt/jenstests/schulth/Science/dms/GaN/Cube2x2x2_1Mn_NCC/sic_11111,MODULE_VERSION_STACK=3.1.6,MANPATH=/opt/intel/cce/9.0/man:/opt/intel/fce/9.0/man:/opt/mpich-ch_p4-icc-1.2.7/man:/opt/modules/default/man:/usr/share/man:/usr/man:/usr/local/share/man:/usr/local/man:/usr/X11R6/man:/opt/pbs/man:/opt/env-switcher/man:/opt/kernel_picker/man:/opt/pvm3/man,HOSTNAME=b05l02,PVM_RSH=ssh,_MODULESBEGINENV_=/home/2vt/.modulesbeginenv,SHELL=/bin/bash,TERM=xterm,HISTSIZE=1000,TMPDIR=/home/2vt/.tmpdir,MODULE_SHELL=sh,OLDPWD=/home/2vt,MODULE_OSCAR_USER=tippensjl,USER=tippensjl,LD_LIBRARY_PATH=/opt/intel/mkl72/lib/em64t:/opt/intel/cce/9.0/lib:/opt/intel/fce/9.0/lib,LS_COLORS=no=00:fi=00:di=00;34:ln=00;36:pi=40;33:so=00;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=00;32:*.cmd=00;32:*.exe=00;32:*.com=00;32:*.btm=00;32:*.bat=00;32:*.sh=00;32:*.csh=00;32:*.tar=00;31:*.tgz=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.zip=00;31:*.z=00;31:*.Z=00;31:*.gz=00;31:*.bz2=00;31:*.bz=00;31:*.tz=00;31:*.rpm=00;31:*.cpio=00;31:*.jpg=00;35:*.gif=00;35:*.bmp=00;35:*.xbm=00;35:*.xpm=00;35:*.png=00;35:*.tif=00;35:,ENV=/home/2vt/.bashrc,OSCAR_HOME=/opt/oscar,PVM_ROOT=/opt/pvm3,PVM_ARCH=LINUX,MODULE_VERSION=3.1.6,MAIL=/var/spool/mail/tippensjl,PATH=/opt/intel/cce/9.0/bin:/opt/intel/fce/9.0/bin:/usr/kerberos/bin:/opt/mpich-ch_p4-icc-1.2.7/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/opt/pbs/bin:/opt/pbs/lib/xpbs/bin:/opt/env-switcher/bin:/opt/kernel_picker/bin:/opt/pvm3/lib:/opt/pvm3/lib/LINUX:/opt/pvm3/bin/LINUX:/usr/local/apitest:/opt/c3-4/:/home/2vt/bin,INPUTRC=/etc/inputrc,PWD=/home/2vt/jenstests/schulth/Science/dms/GaN/Cube2x2x2_1Mn_NCC/sic_11111,_LMFILES_=/opt/modules/oscar-modulefiles/default-manpath/1.0.1:/opt/modules/oscar-modulefiles/torque/1.2.0p5:/opt/env-switcher/share/env-switcher/mpi/mpich-ch_p4-icc-1.2.7:/opt/modules/oscar-modulefiles/switcher/1.0.13:/opt/modules/oscar-modulefiles/kernel_picker/1.4.1.3:/opt/modules/oscar-modulefiles/pvm/3.4.5+4:/opt/modules/modulefiles/oscar-modules/1.0.5:/opt/modules/modulefiles/iforte/9.0:/opt/modules/modulefiles/icce/9.0:/opt/modules/modulefiles/mkl-em64t/7.2,LANG=en_US.UTF-8,MODULEPATH=/opt/env-switcher/share/env-switcher:/opt/modules/oscar-modulefiles:/opt/modules/version:/opt/modules/$MODULE_VERSION/modulefiles:/opt/modules/modulefiles:,LOADEDMODULES=default-manpath/1.0.1:torque/1.2.0p5:mpi/mpich-ch_p4-icc-1.2.7:switcher/1.0.13:kernel_picker/1.4.1.3:pvm/3.4.5+4:oscar-modules/1.0.5:iforte/9.0:icce/9.0:mkl-em64t/7.2,SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass,SHLVL=1,HOME=/home/2vt,LOGNAME=tippensjl,MODULESHOME=/opt/modules/3.1.6,LESSOPEN=|/usr/bin/lesspipe.sh
> %s,G_BROKEN_FILENAMES=1,_=/opt/pbs/bin/qsub,PBS_O_QUEUE=workq'  (r:
> NULL)
> 12/16 14:32:42 INFO:     PBS attribute 'euser'  value: 'tippensjl'  (r: NULL)
> 12/16 14:32:42 MUserAdd(UName,UP)
> 12/16 14:32:42 MUGetHash(tippensjl)
> 12/16 14:32:42 INFO:     hash 'tippensjl' --> 550228005
> 12/16 14:32:42 MUGetHash(tippensjl)
> 12/16 14:32:42 INFO:     hash 'tippensjl' --> 550228005
> 12/16 14:32:42 INFO:     PBS attribute 'egroup'  value: 'tippensjl'  (r: NULL)
> 12/16 14:32:42 MGroupAdd(GName,GP)
> 12/16 14:32:42 MUGetHash(tippensjl)
> 12/16 14:32:42 INFO:     hash 'tippensjl' --> 550228005
> 12/16 14:32:42 MUGetHash(tippensjl)
> 12/16 14:32:42 INFO:     hash 'tippensjl' --> 550228005
> 12/16 14:32:42 MCPRestore(GROUP,tippensjl,Optr)
> 12/16 14:32:42 INFO:     no checkpoint entry for object 'GROUP        
>        tippensjl '
> 12/16 14:32:42 INFO:     group tippensjl added
> 12/16 14:32:42 INFO:     PBS attribute 'queue_rank'  value: '41'  (r: NULL)
> 12/16 14:32:42 INFO:     PBS attribute 'queue_type'  value: 'E'  (r: NULL)
> 12/16 14:32:42 INFO:     PBS attribute 'etime'  value: '1134761206'  (r: NULL)
> 12/16 14:32:42 MJobSetCreds(44,tippensjl,tippensjl,)
> 12/16 14:32:42 MUserAdd(UName,UP)
> 12/16 14:32:42 MUGetHash(tippensjl)
> 12/16 14:32:42 INFO:     hash 'tippensjl' --> 550228005
> 12/16 14:32:42 MUGetHash(tippensjl)
> 12/16 14:32:42 INFO:     hash 'tippensjl' --> 550228005
> 12/16 14:32:42 MGroupAdd(GName,GP)
> 12/16 14:32:42 MUGetHash(tippensjl)
> 12/16 14:32:42 INFO:     hash 'tippensjl' --> 550228005
> 12/16 14:32:42 MUGetHash(tippensjl)
> 12/16 14:32:42 INFO:     hash 'tippensjl' --> 550228005
> 12/16 14:32:42 MJobGetAccount(44,A)
> 12/16 14:32:42 MAMAccountGetDefault(tippensjl,AName,RIndex)
> 12/16 14:32:42 MSSSDoCommand(allocation-manager,NULL,OBuf,ODE,SC,EMsg)
> 12/16 14:32:42 MSysEMSubmit(EM,scheduler,comcom,scheduler,allocation-manager;)
> 12/16 14:32:42 INFO:     EM disabled
> 12/16 14:32:42 MSUConnect(S,TRUE,EMsg)
> 12/16 14:32:42 INFO:     trying to connect to 192.168.79.231 (Port: 7112)
> 12/16 14:32:42 INFO:     successful connect to TCP server (sd: 10)
> 12/16 14:32:42 MSUSendData(S,15000000,FALSE,FALSE)
> 12/16 14:32:42 MSecGetChecksum(Buf,185,Checksum,HMAC64,CSKey)
> 12/16 14:32:42 MSecHMACGetDigest(sss,3,<Body actor="root"><Request
> action="Query" actor="root"><Object>User</Object><Where
> name="Special">False</Where><Get name="Name"></Get><Get
> name="DefaultProject"></Get></Request></Body>,185,CSString,20,DigestString,TRUE,TRUE)
> 12/16 14:32:42 __MSecSHA1Init(context)
> 12/16 14:32:42 __MSecSHA1Transform(context)
> 
> And that's it - it just dies.  I have the feeling that this is
> something fairly easy that I didn't set up correctly...  Just can't
> seem to find what it is - I'm pretty new at this...  Oh, yeah, I am
> using torque 2.0.0p2 if that makes a difference.
> 
> Thank you for any help you can give - I'm pulling my hair out. :-O :)
> 
> -Jen
> 
> Jennifer Tippens
> Unix Admin, ORNL Institutional Cluster
> Oak Ridge National Lab
> _______________________________________________
> mauiusers mailing list
> mauiusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/mauiusers



More information about the mauiusers mailing list