[Mauiusers] gold + maui: bankfailure
Stijn De Weirdt
Stijn.DeWeirdt at ugent.be
Wed Aug 13 10:15:26 MDT 2008
we are doing some testing wrt to gold and maui.
one of teh things we can't get to work is the 'job reservation at job
start time' policy. (charging when job is finished works, so i'm not
suspecting anythig wrong with gold)
when there is a bank failure, jobs start to run no matter what we try.
the maui admin guide states that there is a parameter that can be set
DEFERJOBONFAILURE that should deal with this (ie setting to TRUE should
keep jobs in state Q). although it is not clear wheter this means any
bankfailure or only when the AM can't be reached. but in both cases it
doesn't seem to work ;) (logfile extract at the bottom).
what is even more bizarre, when setting this parameter, maui says
08/13 16:41:04 INFO: AMCFG set to DEFERJOBONFAILURE=TRUE
08/13 16:41:04 MUGetIndex(DEFERJOBONFAILURE,ValList,0)
08/13 16:41:04 WARNING: AM attribute 'DEFERJOBONFAILURE' not handled
i grepped the maui code for anything related and found also a
BANKDEFERONJOBFAILURE (mind the subtle difference in naming), which has
default value of FALSE. so i changed that defautl to TRUE and rebuild
maui, but same result, so maybe it's something else.
hints are welcome.
from maui.log with loglevel 9:
08/13 17:58:46 ERROR: cannot connect to allocation-manager server
08/13 17:58:46 MSysRegEvent(RMFAILURE: cannot connect to
allocation-manager server head1.x.y.z:7112 (command: '<XML>')
08/13 17:58:46 MSysLaunchAction(ASList,1)
08/13 17:58:46 INFO: scheduler action 1 disabled
08/13 17:58:46 INFO: command response 'NULL'
08/13 17:58:46 ALERT: no job data available
08/13 17:58:46 MSUDisconnect(S)
08/13 17:58:46 ALERT: cannot extract status
08/13 17:58:46 ALERT: cannot reserve allocation for job
08/13 17:58:46 WARNING: cannot reserve allocation for job '121',
08/13 17:58:46 MRMJobStart(121,Msg,SC)
08/13 17:58:46 MPBSJobStart(121,torque,Msg,SC)
08/13 15:10:11 WARNING: request failed
08/13 15:10:11 ALERT: request failed with status code 740 (Project
account8 does not exist)
08/13 15:10:11 MSUDisconnect(S)
08/13 15:10:11 ERROR: cannot receive response from allocation-manager
08/13 15:10:11 MSysRegEvent(FAILURE: cannot receive response from
allocation-manager server head1.x.y.z:7112 (cmd: '<XML>')
08/13 15:10:11 MSysLaunchAction(ASList,1)
08/13 15:10:11 INFO: command response 'NULL'
08/13 15:10:11 ALERT: no job data available
08/13 15:10:11 ALERT: cannot extract status
08/13 15:10:11 ALERT: cannot reserve allocation for job
08/13 15:10:11 WARNING: cannot reserve allocation for job '107',
08/13 15:10:11 MRMJobStart(107,Msg,SC)
More information about the mauiusers