[Mauiusers] advance standing reservation

Bill Wichser bill at Princeton.EDU
Mon Mar 27 14:08:27 MST 2006


So I've tried everything I thought that I knew and still I cannot make 
this thing work.  Jobs just defer forever.  I am now the scheduler, 
using qrun whenever I see a job in the quad queue waiting.

I've removed the quad queue entirely and then reentered it, hoping that 
somewhere there was just some typo.  I could sure use some help on this 
one as I've just about scratched my head until it's bleeding!

Bill



Bill Wichser wrote:
> Environment
> -----------
> maui-3.2.6p13
> torque-1.1.0p6
> linux cluster
> 
> I had this working.  Or so I thought.  But after a pbs-server reboot and 
> a maui reboot, jobs just defer.
> 
> I have two nodes with quad processors that I wish to allow only jobs 
> specifying the #PBS -q quad can have access to and run.
> 
> -------------------------------------------------
> In Torque:
> create queue quad
> set queue quad queue_type = Execution
> set queue quad acl_hosts = node076+node077
> set queue quad resources_max.nodect = 2
> set queue quad enabled = True
> set queue quad started = True
> ---------------------------------------------------
> In maui:
> SRCFG[quad] HOSTLIST=node076,node077
> SRCFG[quad] FLAGS=BYNAME
> SRCFG[quad] PERIOD=INFINITY
> SRCFG[quad] CLASSLIST=quad
> 
> CLASSCFG[quad]          PRIORITY=0
> CLASSCFG[quad]          FLAGS=ADVRES:quad.0.0
> ------------------------------------------------------
> 
> diagnose -r
> 
> quad.0.0                   User DEF   -00:13:26    INFINITY     INFINITY 
>    2    2    8
>     Flags: STANDINGRES BYNAME
>     ACL: RES==quad.0= CLASS==quad+
>     CL:  RES==quad.0
>     Task Resources: PROCS: [ALL]
>     Attributes (HostList='node076 node077')
>     Active PH: 0.00/1.79 (0.00%)
>     SRAttributes (TaskCount: 0  StartTime: 00:00:00  EndTime: 
> 1:00:00:00  Days: ALL)
> ---------------------------------------------------------------------------
> 
> so the reservation is there and appears active.  But when I do a 
> "checknode node077" I see that in reservations there is something which 
> doesn't seem correct.
> 
> ---------------------------------------------------------------------------- 
> 
> checking node node077
> 
> State:      Idle  (in current state for 00:13:26)
> Configured Resources: PROCS: 4  MEM: 15G  SWAP: 16G  DISK: 1M
> Utilized   Resources: [NONE]
> Dedicated  Resources: [NONE]
> Opsys:       DEFAULT  Arch:       linux
> Speed:      1.00  Load:       0.000
> Network:    [DEFAULT]
> Features:   [quad]
> Attributes: [Batch]
> Classes:    [short 4:4][long 4:4][verylong 4:4][quad 4:4][default 
> 4:4][single 4:4]
> 
> Total Time:   INFINITY  Up:   INFINITY (81.54%)  Active:   INFINITY 
> (37.11%)
> 
> Reservations:
>   User 'quad.0.0'(x1)  -00:13:26 ->   INFINITY (  INFINITY)
>     Blocked Resources at -00:13:26   Procs: 4/4 (100.00%)
> ------------------------------------------------------------------------
> That blocked resources line.
> So I submit a job specifying this quad queue and it immediately gets 
> placed into a deferred state in the blocked list.
> 
> --------------------------------------------------------------------------
> checking job 24640
> 
> State: Idle  EState: Deferred
> Creds:  user:bill  group:bill  class:quad  qos:DEFAULT
> WallTime: 00:00:00 of 1:12:00:00
> SubmitTime: Fri Mar 10 09:32:35
>   (Time Queued  Total: 3:08:51  Eligible: 00:05:27)
> 
> Total Tasks: 4
> 
> Req[0]  TaskCount: 4  Partition: ALL
> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
> Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
> 
> 
> IWD: [NONE]  Executable:  [NONE]
> Bypass: 0  StartCount: 0
> PartitionMask: [ALL]
> Flags:       RESTARTABLE
> 
> job is deferred.  Reason:  NoResources  (cannot create reservation for 
> job '24640' (intital reservation attempt)
> )
> Holds:    Defer  (hold reason:  NoResources)
> PE:  4.00  StartPriority:  3087
> cannot select job 24640 for partition DEFAULT (job hold active)
> -----------------------------------------------------------------------
> 
> And the Maui logs show:
> 
> 03/10 12:46:06 INFO:     node node077 can provide resources for job 24640:0
> 03/10 12:46:06 MLocalJobCheckNRes(24640,node077,2140000000)
> 03/10 12:46:06 INFO:     8 feasible tasks found for job 24640:0 in 
> partition DEFAULT (4 Needed)
> 03/10 12:46:06 
> MJobGetSNRange(24640,0,node076,(4 at 00:00:00),256,Affinity,Type,ARange,BRes)
> 03/10 12:46:06 INFO:     attempting to get resources for 24640 4 * (P: 1 
>  M: 0  S: 0  D: 0)
> 03/10 12:46:06 MResCheckJAccess(24612,24640,129600,Same,Affinity)
> 03/10 12:46:06 MResCheckJAccess(quad.0.0,24640,129600,Same,Affinity)
> 03/10 12:46:06 MResCheckJAccess(24612,24640,129600,Same,Affinity)
> 03/10 12:46:06 MResCheckJAccess(quad.0.0,24640,129600,Same,Affinity)
> 03/10 12:46:06 INFO:     ARange[0] too short for job 24640 (MR: 1 < W: 
> 129600):  removing range
> 03/10 12:46:06 INFO:     node node076 unavailable for job 24640 at 00:00:00
> 03/10 12:46:06 INFO:     no reservation time found for job 24640 on node 
> node076 at 00:00:00
> 03/10 12:46:06 
> MJobGetSNRange(24640,0,node077,(4 at 00:00:00),256,Affinity,Type,ARange,BRes)
> 03/10 12:46:06 INFO:     attempting to get resources for 24640 4 * (P: 1 
>  M: 0  S: 0  D: 0)
> 03/10 12:46:06 MResCheckJAccess(quad.0.0,24640,129600,Same,Affinity)
> 03/10 12:46:06 MResCheckJAccess(quad.0.0,24640,129600,Same,Affinity)
> 03/10 12:46:06 INFO:     ARange[0] too short for job 24640 (MR: 1 < W: 
> 129600):  removing range
> 03/10 12:46:06 INFO:     node node077 unavailable for job 24640 at 00:00:00
> 03/10 12:46:06 INFO:     no reservation time found for job 24640 on node 
> node077 at 00:00:00
> 03/10 12:46:06 MJobSelectFRL(24640,G,1,RCount)
> 03/10 12:46:06 ALERT:    job 24640 cannot run in any partition
> 03/10 12:46:06 ALERT:    cannot create new reservation for job 24640 
> (shape[1] 4)
> 03/10 12:46:06 ALERT:    cannot create new reservation for job 24640
> 03/10 12:46:06 MJobSetHold(24640,16,00:05:00,NoResources,cannot create 
> reservation for job '24640' (intital reservation attempt)
> 03/10 12:46:06 ALERT:    job '24640' cannot run (deferring job for 300 
> seconds)
> ---------------------------------------------------------------------------- 
> 
> 
> I must be missing something here but I've reread the documentation and 
> find nothing.  I'm not sure how to further debug.  Can anyone provide me 
> with a further clue as to what might be missing?
> 
> Thanks,
> Bill
> 
> _______________________________________________
> mauiusers mailing list
> mauiusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/mauiusers


More information about the mauiusers mailing list