[Mauiusers] Job not running on secondary partition

Hyrum Carroll hyrum at clusterresources.com
Fri Dec 3 11:15:27 MST 2004


Matt,
I am trying to recreate the problems that you are seeing.  I used the
configuration that you specified on maui-3.2.6.p9 and torque-1.1.0p5.  I
setup a system with 1 class and 2 nodes, one in partition STAFF and the
other in partition GENERAL.  I ran job A with a walltime of 1:00:00 and
job B with walltime of 30:00.  Job A started immediately in STAFF and B
started immediately in GENERAL.  Furthermore, I started job C and it
received a reservation on the node in partition GENERAL.  Job C migrated
over to the node in partition STAFF after canceling job A. 

Matt, please send me more information (e.g, loglevel 7 log, other
settings, etc.) to aid in discovering this problem.

Hyrum Carroll
Cluster Resources, Inc.


On Thu, 2004-12-02 at 11:40, Matthew Britt wrote:
> 	From: 	  msbritt at umich.edu
> 	Subject: 	Job not being running on secondary partition
> 	Date: 	December 1, 2004 11:41:58 PM EST
> 	To: 	  mauiusers at supercluster.org
> 
> We have a problem with jobs not starting up in a secondary partition, 
> using maui-3.2.6p9 and PBSPro 5.4.0.  After the primary partition is 
> full, then next job (or set of jobs - seems based on RESERVATIONDEPTH) 
> will have a reservation created for it, even though there are plenty of 
> resources available in the secondary partition.  Any subsequent jobs 
> will run in the secondary partition w/o delay.   We've tested this 
> setting RESERVATIONDEPTH to 0, 1 and 2, which results in 0, 1 or 2 jobs 
> being scheduled to run via reservation, rather than running 
> immediately.
> 
> Is there a method/configuration which will automatically run the job in 
> the secondary partition?
> 
> Here are the configs:
> 
> BACKFILLPOLICY        FIRSTFIT
> RESERVATIONPOLICY     CURRENTHIGHEST
> RESERVATIONDEPTH       1
> 
> # Try to make jobs run on one processor type, if possible
> NODEALLOCATIONPOLICY  MINRESOURCE
> 
> SYSCFG                  PLIST=
> 
> CLASSCFG[staff] PLIST=STAFF:GENERAL PDEF=STAFF
> 
> 
> NODECFG[node001m] MAXJOB=1 PROCSPEED=2600 PARTITION=GENERAL
> NODECFG[node002m] MAXJOB=1 PROCSPEED=2600 PARTITION=GENERAL
> NODECFG[node003m] MAXJOB=1 PROCSPEED=1600 PARTITION=STAFF   # This is 
> the only node in the STAFF partition
> NODECFG[node004m] MAXJOB=1 PROCSPEED=2600 PARTITION=GENERAL
> NODECFG[node005m] MAXJOB=1 PROCSPEED=1600 PARTITION=GENERAL
> NODECFG[node006m] MAXJOB=1 PROCSPEED=2600 PARTITION=GENERAL
> 
> 
> 
> 
> Here's the checkjob output:
> *******first job - runs in primary partition *********
> State: Running
> Creds:  user:msbritt  group:users  class:staff  qos:DEFAULT
> WallTime: 00:00:02 of 00:10:00
> SubmitTime: Wed Dec  1 23:30:06
>    (Time Queued  Total: 00:00:01  Eligible: 00:00:01)
> 
> StartTime: Wed Dec  1 23:30:07
> Total Tasks: 1
> 
> Req[0]  TaskCount: 1  Partition: STAFF
> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
> Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
> NodeCount: 1
> Allocated Nodes:
> [node003m:1]
> 
> 
> IWD: [NONE]  Executable:  [NONE]
> Bypass: 0  StartCount: 1
> PartitionMask: [GENERAL][STAFF]
> Flags:       RESTARTABLE
> 
> Reservation '129683' (-00:00:02 -> 00:09:58  Duration: 00:10:00)
> PE:  1.00  StartPriority:  1
> 
> **************second job - the one that gets "stuck" ***************
> checking job 129684
> 
> State: Idle
> Creds:  user:msbritt  group:users  class:staff  qos:DEFAULT
> WallTime: 00:00:00 of 00:10:00
> SubmitTime: Wed Dec  1 23:30:07
>    (Time Queued  Total: 00:01:02  Eligible: 00:01:02)
> 
> Total Tasks: 1
> 
> Req[0]  TaskCount: 1  Partition: ALL
> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
> Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
> NodeCount: 1
> 
> 
> IWD: [NONE]  Executable:  [NONE]
> Bypass: 1  StartCount: 0
> PartitionMask: [GENERAL][STAFF]
> Flags:       RESTARTABLE
> 
> Reservation '129684' (00:00:00 -> 00:10:00  Duration: 00:10:00)
> PE:  1.00  StartPriority:  1
> job can run in partition GENERAL (9 procs available.  1 procs required)
> job cannot run in partition STAFF (insufficient idle procs available: 0 
> < 1)
> 
> **************third job - runs in the secondary 
> partition*******************
> checking job 129685
> 
> State: Running
> Creds:  user:msbritt  group:users  class:staff  qos:DEFAULT
> WallTime: 00:01:41 of 00:10:00
> SubmitTime: Wed Dec  1 23:30:08
>    (Time Queued  Total: 00:00:01  Eligible: 00:00:01)
> 
> StartTime: Wed Dec  1 23:30:09
> Total Tasks: 1
> 
> Req[0]  TaskCount: 1  Partition: GENERAL
> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
> Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
> NodeCount: 1
> Allocated Nodes:
> [node065m:1]
> 
> 
> IWD: [NONE]  Executable:  [NONE]
> Bypass: 0  StartCount: 1
> PartitionMask: [GENERAL][STAFF]
> Flags:       BACKFILL RESTARTABLE
> 
> Reservation '129685' (-00:01:31 -> 00:08:29  Duration: 00:10:00)
> PE:  1.00  StartPriority:  1
> 
> 
> Thanks for any help!
> 
>   -  matt
> 
> _______________________________________________
> mauiusers mailing list
> mauiusers at supercluster.org
> http://supercluster.org/mailman/listinfo/mauiusers



More information about the mauiusers mailing list