[Mauiusers] Suspended jobs resume execution

Angel de Vicente angelv at iac.es
Thu Apr 20 05:03:43 MDT 2006


Ronny, 

I'm not sure what could be happenning on your side, but in case it is useful
I'll tell you how I got it partially working.

In my cluster I got 

OS:      CentOS 4.3
MPI:     Open MPI 1.0.2
Torque:  Torque 2.0.0p8
Maui:    Maui 3.2.6p14

The torque configuration is not very exciting and the maui configuration is
minimal (I include them below).

With this I am able to submit a preemptable job with QOS:low (the default) by
doing:

[angelv at boldo]$ qsub -l nodes=4:ppn=4 submit-greetings 

Submit-greetings is just:

#!/bin/bash
cd $PBS_O_WORKDIR
NP=$(wc -l $PBS_NODEFILE | awk '{print $1}')
cat $PBS_NODEFILE
/usr/local/openmpi/openmpi-1.0.2/bin/mpiexec -machinefile $PBS_NODEFILE -n $NP ./greetings

Then I can submit a preemptor job with:

[angelv at boldo]$ qsub -l nodes=4:ppn=4 submit-greetings -W x="QOS:hi"

With this Maui "suspends" the job successfully (at least that is what the
maui.log says):

04/20 11:29:10 INFO:     16 feasible tasks found for job 305:0 in partition DEFAULT (16 Needed)
04/20 11:29:10 INFO:     inadequate feasible tasks found for job 305:0 (0 < 16)
04/20 11:29:10 INFO:     inadequate nodes found for job 305:0 (0 < 4)
04/20 11:29:10 MJobSelectPJobList(305,16,4,FJobList,PJList,PTCList,PNCList,PTL)
04/20 11:29:10 MRMJobSuspend(304,Msg,SC)
04/20 11:29:10 MPBSJobSuspend(304,BOLDO,Msg,SC)
04/20 11:29:10 INFO:     job '304' successfully suspended
04/20 11:29:10 MResDestroy(304)
04/20 11:29:10 MResChargeAllocation(304,2)
04/20 11:29:10 INFO:     attribute 'PREEMPTEE' set for job 304
04/20 11:29:10 ERROR:    invalid nodelist for job 305:0 (inadequate taskcount, 0 < 16)
04/20 11:29:10 ERROR:    cannot allocate nodes to job '305' in partition DEFAULT
04/20 11:29:10 MJobPReserve(305,DEFAULT,ResCount,ResCountRej)
04/20 11:29:10 MJobReserve(305,Priority)
04/20 11:29:10 INFO:     16 feasible tasks found for job 305:0 in partition DEFAULT (16 Needed)
04/20 11:29:10 INFO:     16 feasible tasks found for job 305:0 in partition DEFAULT (16 Needed)
04/20 11:29:10 INFO:     located resources for 16 tasks (16) in best partition DEFAULT for job 305 at time 00:00:01
04/20 11:29:10 INFO:     tasks located for job 305:  16 of 16 required (16 feasible)
04/20 11:29:10 MJobDistributeTasks(305,BOLDO,NodeList,TaskMap)
04/20 11:29:10 MResJCreate(305,MNodeList,00:00:01,Priority,Res)
04/20 11:29:10 INFO:     job '305' reserved 16 tasks (partition DEFAULT) to start in 00:00:01 on Thu Apr 20 11:29:11
 

But the suspension is not perfect. Looking at the load in the different nodes, I
can see that in the node where the job started, all things are fine (I have 4
greetings processes stopped, state T), and four running, but in the other nodes
8 greetings processes are running...

Also the REMAINING time reported by Maui keeps decreasing for the Suspended job,
which is not ideal. 

Anyone knows if these problems can be solved somehow?

Thanks,
Angel de Vicente

===============================

Torque configuration
--------------------
Qmgr: print server
#
# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server pbs_version = 2.0.0p8

Qmgr: print queue batch
#
# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True


Maui configuration 
-------------------

# maui.cfg 3.2.6p14
SERVERHOST            boldo
ADMIN1                root

RMCFG[BOLDO] TYPE=PBS
AMCFG[bank]  TYPE=NONE
RMPOLLINTERVAL        00:00:30
SERVERPORT            42559
SERVERMODE            NORMAL
LOGFILE               maui.log
LOGFILEMAXSIZE        10000000
LOGLEVEL              3

# Job Priority: http://clusterresources.com/mauidocs/5.1jobprioritization.html
QUEUETIMEWEIGHT       1 
QOSWEIGHT             10

BACKFILLPOLICY        FIRSTFIT
RESERVATIONPOLICY     CURRENTHIGHEST
NODEALLOCATIONPOLICY  MINRESOURCE

# QOS: http://clusterresources.com/mauidocs/7.3qos.html

QOSCFG[hi]      PRIORITY=100 QFLAGS=PREEMPTOR
QOSCFG[low]     PRIORITY=-1000 QFLAGS=PREEMPTEE
CLASSCFG[batch] QDEF=low QLIST=hi:low
PREEMPTPOLICY	SUSPEND


-- 
----------------------------------
http://www.iac.es/galeria/angelv/

PostDoc Software Support
Instituto de Astrofisica de Canarias



More information about the mauiusers mailing list