[torqueusers] Problems with suspend/resume
Fedele Stabile
fedele at fis.unical.it
Wed Apr 13 04:24:23 MDT 2005
Hello,
on my cluster is installed torque-1.2.0p2 e maui-3.2.6p11.
In maui.cfg i have
# maui.cfg 3.2p8
SERVERHOST linuxlab.fis.unical.it
ADMIN1 root
RMCFG[base] TYPE=PBS
RMCFG[base] SUSPENDSIG=20
AMCFG[bank] TYPE=NONE
RMPOLLINTERVAL 00:00:30
SERVERPORT 42559
SERVERMODE NORMAL
LOGFILE maui.log
LOGFILEMAXSIZE 10000000
LOGLEVEL 3
QUEUETIMEWEIGHT 1
BACKFILLPOLICY FIRSTFIT
RESERVATIONPOLICY CURRENTHIGHEST
NODEALLOCATIONPOLICY MINRESOURCE
PREEMPTIONPOLICY SUSPEND
QOSCFG[DEFAULT] QFLAGS=PREEMPTOR
QOSCFG[DEFAULT] QFLAGS=PREEMPTEE
Running an MPI job, if i suspend it with qsig -s suspend it will be
suspended but pbsnodes -a shows nodes in a state job-exclusive.
The output of chekjob is
checking job 20
State: Suspended EState: Running
Creds: user:fedele group:users class:batch qos:DEFAULT
WallTime: 00:00:31 of 1:00:00
Suspended Wall Time: 00:01:32
SubmitTime: Tue Apr 12 11:08:29
(Time Queued Total: 1:00:08 Eligible: 00:58:04)
Total Tasks: 16
Req[0] TaskCount: 16 Partition: DEFAULT
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [NONE]
NodeCount: 16
Allocated Nodes:
[pc16:1][pc15:1][pc14:1][pc13:1]
[pc12:1][pc11:1][pc10:1][pc9:1]
[pc8:1][pc7:1][pc6:1][pc5:1]
[pc4:1][pc3:1][pc2:1][pc1:1]
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 1
PartitionMask: [ALL]
Flags: RESTARTABLE PREEMPTEE
Attr: PREEMPTEE
EState 'Running' does not match current state 'Suspended'
Reservation '20' (-00:02:04 -> 00:57:56 Duration: 1:00:00)
PE: 16.00 StartPriority: 58
cannot select job 20 for partition DEFAULT (non-idle expected state
'Running')
So job is not suspended, can you help me?
Thank you, Fedele
More information about the torqueusers
mailing list