[Mauiusers] Re: [torqueusers] Problems with suspend/resume

Jerome jerome at ibt.unam.mx
Wed Apr 13 09:30:07 MDT 2005


Fedele Stabile wrote:
> Hello,
> on my cluster is installed torque-1.2.0p2 e maui-3.2.6p11.
> In maui.cfg i have
> # maui.cfg 3.2p8
> SERVERHOST            linuxlab.fis.unical.it
> ADMIN1                root
> RMCFG[base]  TYPE=PBS
> RMCFG[base] SUSPENDSIG=20
> AMCFG[bank]  TYPE=NONE
> RMPOLLINTERVAL        00:00:30 
> SERVERPORT            42559
> SERVERMODE            NORMAL
> LOGFILE               maui.log
> LOGFILEMAXSIZE        10000000
> LOGLEVEL              3
> QUEUETIMEWEIGHT       1
> BACKFILLPOLICY        FIRSTFIT
> RESERVATIONPOLICY     CURRENTHIGHEST
> NODEALLOCATIONPOLICY  MINRESOURCE
> PREEMPTIONPOLICY SUSPEND
> QOSCFG[DEFAULT]  QFLAGS=PREEMPTOR
> QOSCFG[DEFAULT]  QFLAGS=PREEMPTEE
> 
> Running an MPI job, if i suspend it with qsig -s suspend it will be
> suspended but pbsnodes -a shows nodes in a state job-exclusive.
> The output of chekjob is 
> checking job 20
>  
> State: Suspended  EState: Running
> Creds:  user:fedele  group:users  class:batch  qos:DEFAULT
> WallTime: 00:00:31 of 1:00:00
> Suspended Wall Time: 00:01:32
> SubmitTime: Tue Apr 12 11:08:29
>   (Time Queued  Total: 1:00:08  Eligible: 00:58:04)
>  
> Total Tasks: 16
>  
> Req[0]  TaskCount: 16  Partition: DEFAULT
> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
> Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
> NodeCount: 16
> Allocated Nodes:
> [pc16:1][pc15:1][pc14:1][pc13:1]
> [pc12:1][pc11:1][pc10:1][pc9:1]
> [pc8:1][pc7:1][pc6:1][pc5:1]
> [pc4:1][pc3:1][pc2:1][pc1:1]
>  
>  
>  
> IWD: [NONE]  Executable:  [NONE]
> Bypass: 0  StartCount: 1
> PartitionMask: [ALL]
> Flags:       RESTARTABLE PREEMPTEE
> Attr:        PREEMPTEE
>  
> EState 'Running' does not match current state 'Suspended'
> Reservation '20' (-00:02:04 -> 00:57:56  Duration: 1:00:00)
> PE:  16.00  StartPriority:  58
> cannot select job 20 for partition DEFAULT (non-idle expected state
> 'Running')
>  
> So job is not suspended, can you help me?
> Thank you, Fedele
> 

It's due to the fact that Torque don't released the nodes with the 
suspended jobs.
Look at this mail :
http://www.supercluster.org/pipermail/mauiusers/2004-July/001284.html

It's propose a pathc that going well fro mi case.
Hope that help.

-- 
-- Jérôme
Autrefois, quand une jeune fille était gênée,
elle rougissait.
Aujourd'hui, quand une jeune fille rougit,
elle est gênée.
	(Mme Simone)


More information about the mauiusers mailing list