[Mauiusers] Torque-Maui preemption problem
Zvika Galant
zvika at Camero-Tech.com
Wed Sep 28 01:44:02 MDT 2005
Hi
I have got an installation of Torque 1.2.0 & Maui 3.2.6p11.
The Maui is configured of 2 parallel queues for high & low priorities as
follows:
SERVERHOST creambo
ADMIN1 root
RMCFG[base] TYPE=PBS HOST=creambo EPORT=15004
SOCKETPROTOCOL=HTTP at RMNMHOST@ NMPORT=12321
CHARGEPOLICY=DEBITALLWC JOBFAILUREACTION=NONE TIMEOUT=15
RMPOLLINTERVAL 00:00:02
SERVERPORT 42559
SERVERMODE NORMAL
LOGFILE maui.log
LOGFILEMAXSIZE 10000000
LOGLEVEL 3
QUEUETIMEWEIGHT 1
BACKFILLPOLICY BESTFIT
RESERVATIONPOLICY CURRENTHIGHEST
NODEALLOCATIONPOLICY MINRESOURCE
QOSWEIGHT 1
CREDWEIGHT 1
PREEMPTOIONPOLICY REQUEUE
QOSCFG[hi] PRIORITY=1000 XFTARGET=100 QFLAGS=PREEMPTOR
QOSCFG[low] PRIORITY=-100 QFLAGS=PREEMPTEE
CLASSCFG[long] QDEF=low
CLASSCFG[short] QDEF=hi
These 2 queues are configured in Qmgr as follows:
Max open servers: 4
Qmgr: list queue long
Queue long
queue_type = Execution
Priority = 100
total_jobs = 4
state_count = Transit:0 Queued:4 Held:0 Waiting:0 Running:0
Exiting:0
max_running = 16
enabled = True
Qmgr: list queue short
Queue short
queue_type = Execution
Priority = 1000
total_jobs = 0
state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0
Exiting:0
max_running = 16
resources_assigned.nodect = 0
enabled = True
Once a high priority job is submitted to a totally busy queue, a low
priority job is preempted, but there is a problem to restart the
preempted job once the resource is freed.
Such a problem occurs also when using preemption policy of Checkpoint or
Suspend.
Following a Maui log of such a problem:
09/28 09:00:28 INFO: 16 feasible tasks found for job 72710:0 in
partition DEFAULT (1 Needed)
09/28 09:00:28 INFO: tasks located for job 72710: 1 of 1 required
(1 feasible)
09/28 09:00:28 MJobStart(72710)
09/28 09:00:28 MJobDistributeTasks(72710,base,NodeList,TaskMap)
09/28 09:00:28 MAMAllocJReserve(72710,RIndex,ErrMsg)
09/28 09:00:28 MRMJobStart(72710,Msg,SC)
09/28 09:00:28 MPBSJobStart(72710,base,Msg,SC)
09/28 09:00:28 MPBSJobModify(72710,Resource_List,Resource,wild1)
09/28 09:00:28 ERROR: job '72710' cannot be started: (rc: 15044
errmsg: 'Resource temporarily unavailable' hostlist: 'wild1')
09/28 09:00:28 MPBSJobModify(72710,Resource_List,Resource,1)
09/28 09:00:28 ALERT: cannot start job 72710 (RM 'base' failed in
function 'jobstart')
09/28 09:00:28 WARNING: cannot start job '72710' through resource
manager
09/28 09:00:28 ALERT: job '72710' deferred after 2 failed start
attempts (API failure on last attempt)
09/28 09:00:28 MJobSetHold(72710,16,1:00:00,RMFailure,cannot start job -
RM failure, rc: 15044, msg: 'Resource temporarily unavailable')
09/28 09:00:28 ALERT: job '72710' cannot run (deferring job for 3600
seconds)
09/28 09:00:28 MSysRegEvent(JOBDEFER: defer hold placed on job '72710'.
reason: 'RMFailure',0,0,1)
09/28 09:00:28 MSysLaunchAction(ASList,1)
09/28 09:00:28 ERROR: cannot start job '72710' in partition DEFAULT
09/28 09:00:28 MJobPReserve(72710,DEFAULT,ResCount,ResCountRej)
09/28 09:00:28 MJobPReserve(72712,DEFAULT,ResCount,ResCountRej)
Another job resubmission is failed due to rc=15041.
I must say that this phenomenon is consistent.
Is there anybody that has encountered such a problem?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20050928/a74e5dc3/attachment-0001.html
More information about the mauiusers
mailing list