[Mauiusers] Torque-Maui preemption problem

Zvika Galant zvika at Camero-Tech.com
Wed Sep 28 01:44:02 MDT 2005


Hi

 

I have got an installation of Torque 1.2.0 & Maui 3.2.6p11.

The Maui is configured of 2 parallel queues for high & low priorities as
follows:

 

SERVERHOST              creambo

ADMIN1                root

RMCFG[base] TYPE=PBS HOST=creambo EPORT=15004

SOCKETPROTOCOL=HTTP at RMNMHOST@ NMPORT=12321

CHARGEPOLICY=DEBITALLWC JOBFAILUREACTION=NONE TIMEOUT=15

RMPOLLINTERVAL       00:00:02

SERVERPORT            42559

SERVERMODE            NORMAL

LOGFILE               maui.log

LOGFILEMAXSIZE        10000000

LOGLEVEL              3

QUEUETIMEWEIGHT       1

BACKFILLPOLICY         BESTFIT

RESERVATIONPOLICY     CURRENTHIGHEST

NODEALLOCATIONPOLICY  MINRESOURCE

QOSWEIGHT 1

CREDWEIGHT 1

PREEMPTOIONPOLICY REQUEUE

QOSCFG[hi]  PRIORITY=1000 XFTARGET=100 QFLAGS=PREEMPTOR

QOSCFG[low] PRIORITY=-100 QFLAGS=PREEMPTEE

CLASSCFG[long]    QDEF=low

CLASSCFG[short]   QDEF=hi

 

 

These 2 queues are configured in Qmgr as follows:

 

Max open servers: 4

Qmgr: list queue long

Queue long

        queue_type = Execution

        Priority = 100

        total_jobs = 4

        state_count = Transit:0 Queued:4 Held:0 Waiting:0 Running:0
Exiting:0 

        max_running = 16

        enabled = True

 

Qmgr: list queue short

Queue short

        queue_type = Execution

        Priority = 1000

        total_jobs = 0

        state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0
Exiting:0 

        max_running = 16

        resources_assigned.nodect = 0

        enabled = True

 

 

Once a high priority job is submitted to a totally busy queue, a low
priority job is preempted, but there is a problem to restart the
preempted job once the resource is freed. 

Such a problem occurs also when using preemption policy of Checkpoint or
Suspend.

Following a Maui log of such a problem:

 

09/28 09:00:28 INFO:     16 feasible tasks found for job 72710:0 in
partition DEFAULT (1 Needed)

09/28 09:00:28 INFO:     tasks located for job 72710:  1 of 1 required
(1 feasible)

09/28 09:00:28 MJobStart(72710)

09/28 09:00:28 MJobDistributeTasks(72710,base,NodeList,TaskMap)

09/28 09:00:28 MAMAllocJReserve(72710,RIndex,ErrMsg)

09/28 09:00:28 MRMJobStart(72710,Msg,SC)

09/28 09:00:28 MPBSJobStart(72710,base,Msg,SC)

09/28 09:00:28 MPBSJobModify(72710,Resource_List,Resource,wild1)

09/28 09:00:28 ERROR:    job '72710' cannot be started: (rc: 15044
errmsg: 'Resource temporarily unavailable'  hostlist: 'wild1')

09/28 09:00:28 MPBSJobModify(72710,Resource_List,Resource,1)

09/28 09:00:28 ALERT:    cannot start job 72710 (RM 'base' failed in
function 'jobstart')

09/28 09:00:28 WARNING:  cannot start job '72710' through resource
manager

09/28 09:00:28 ALERT:    job '72710' deferred after 2 failed start
attempts (API failure on last attempt)

09/28 09:00:28 MJobSetHold(72710,16,1:00:00,RMFailure,cannot start job -
RM failure, rc: 15044, msg: 'Resource temporarily unavailable')

09/28 09:00:28 ALERT:    job '72710' cannot run (deferring job for 3600
seconds)

09/28 09:00:28 MSysRegEvent(JOBDEFER:  defer hold placed on job '72710'.
reason: 'RMFailure',0,0,1)

09/28 09:00:28 MSysLaunchAction(ASList,1)

09/28 09:00:28 ERROR:    cannot start job '72710' in partition DEFAULT

09/28 09:00:28 MJobPReserve(72710,DEFAULT,ResCount,ResCountRej)

09/28 09:00:28 MJobPReserve(72712,DEFAULT,ResCount,ResCountRej)

 

 

Another job resubmission is failed due to rc=15041.

I must say that this phenomenon is consistent.

 

Is there anybody that has encountered such a problem?

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20050928/a74e5dc3/attachment-0001.html


More information about the mauiusers mailing list