[Mauiusers] Job not running on secondary partition
Hyrum Carroll
hyrum at clusterresources.com
Fri Dec 3 11:15:27 MST 2004
Matt,
I am trying to recreate the problems that you are seeing. I used the
configuration that you specified on maui-3.2.6.p9 and torque-1.1.0p5. I
setup a system with 1 class and 2 nodes, one in partition STAFF and the
other in partition GENERAL. I ran job A with a walltime of 1:00:00 and
job B with walltime of 30:00. Job A started immediately in STAFF and B
started immediately in GENERAL. Furthermore, I started job C and it
received a reservation on the node in partition GENERAL. Job C migrated
over to the node in partition STAFF after canceling job A.
Matt, please send me more information (e.g, loglevel 7 log, other
settings, etc.) to aid in discovering this problem.
Hyrum Carroll
Cluster Resources, Inc.
On Thu, 2004-12-02 at 11:40, Matthew Britt wrote:
> From: msbritt at umich.edu
> Subject: Job not being running on secondary partition
> Date: December 1, 2004 11:41:58 PM EST
> To: mauiusers at supercluster.org
>
> We have a problem with jobs not starting up in a secondary partition,
> using maui-3.2.6p9 and PBSPro 5.4.0. After the primary partition is
> full, then next job (or set of jobs - seems based on RESERVATIONDEPTH)
> will have a reservation created for it, even though there are plenty of
> resources available in the secondary partition. Any subsequent jobs
> will run in the secondary partition w/o delay. We've tested this
> setting RESERVATIONDEPTH to 0, 1 and 2, which results in 0, 1 or 2 jobs
> being scheduled to run via reservation, rather than running
> immediately.
>
> Is there a method/configuration which will automatically run the job in
> the secondary partition?
>
> Here are the configs:
>
> BACKFILLPOLICY FIRSTFIT
> RESERVATIONPOLICY CURRENTHIGHEST
> RESERVATIONDEPTH 1
>
> # Try to make jobs run on one processor type, if possible
> NODEALLOCATIONPOLICY MINRESOURCE
>
> SYSCFG PLIST=
>
> CLASSCFG[staff] PLIST=STAFF:GENERAL PDEF=STAFF
>
>
> NODECFG[node001m] MAXJOB=1 PROCSPEED=2600 PARTITION=GENERAL
> NODECFG[node002m] MAXJOB=1 PROCSPEED=2600 PARTITION=GENERAL
> NODECFG[node003m] MAXJOB=1 PROCSPEED=1600 PARTITION=STAFF # This is
> the only node in the STAFF partition
> NODECFG[node004m] MAXJOB=1 PROCSPEED=2600 PARTITION=GENERAL
> NODECFG[node005m] MAXJOB=1 PROCSPEED=1600 PARTITION=GENERAL
> NODECFG[node006m] MAXJOB=1 PROCSPEED=2600 PARTITION=GENERAL
>
>
>
>
> Here's the checkjob output:
> *******first job - runs in primary partition *********
> State: Running
> Creds: user:msbritt group:users class:staff qos:DEFAULT
> WallTime: 00:00:02 of 00:10:00
> SubmitTime: Wed Dec 1 23:30:06
> (Time Queued Total: 00:00:01 Eligible: 00:00:01)
>
> StartTime: Wed Dec 1 23:30:07
> Total Tasks: 1
>
> Req[0] TaskCount: 1 Partition: STAFF
> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
> Opsys: [NONE] Arch: [NONE] Features: [NONE]
> NodeCount: 1
> Allocated Nodes:
> [node003m:1]
>
>
> IWD: [NONE] Executable: [NONE]
> Bypass: 0 StartCount: 1
> PartitionMask: [GENERAL][STAFF]
> Flags: RESTARTABLE
>
> Reservation '129683' (-00:00:02 -> 00:09:58 Duration: 00:10:00)
> PE: 1.00 StartPriority: 1
>
> **************second job - the one that gets "stuck" ***************
> checking job 129684
>
> State: Idle
> Creds: user:msbritt group:users class:staff qos:DEFAULT
> WallTime: 00:00:00 of 00:10:00
> SubmitTime: Wed Dec 1 23:30:07
> (Time Queued Total: 00:01:02 Eligible: 00:01:02)
>
> Total Tasks: 1
>
> Req[0] TaskCount: 1 Partition: ALL
> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
> Opsys: [NONE] Arch: [NONE] Features: [NONE]
> NodeCount: 1
>
>
> IWD: [NONE] Executable: [NONE]
> Bypass: 1 StartCount: 0
> PartitionMask: [GENERAL][STAFF]
> Flags: RESTARTABLE
>
> Reservation '129684' (00:00:00 -> 00:10:00 Duration: 00:10:00)
> PE: 1.00 StartPriority: 1
> job can run in partition GENERAL (9 procs available. 1 procs required)
> job cannot run in partition STAFF (insufficient idle procs available: 0
> < 1)
>
> **************third job - runs in the secondary
> partition*******************
> checking job 129685
>
> State: Running
> Creds: user:msbritt group:users class:staff qos:DEFAULT
> WallTime: 00:01:41 of 00:10:00
> SubmitTime: Wed Dec 1 23:30:08
> (Time Queued Total: 00:00:01 Eligible: 00:00:01)
>
> StartTime: Wed Dec 1 23:30:09
> Total Tasks: 1
>
> Req[0] TaskCount: 1 Partition: GENERAL
> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
> Opsys: [NONE] Arch: [NONE] Features: [NONE]
> NodeCount: 1
> Allocated Nodes:
> [node065m:1]
>
>
> IWD: [NONE] Executable: [NONE]
> Bypass: 0 StartCount: 1
> PartitionMask: [GENERAL][STAFF]
> Flags: BACKFILL RESTARTABLE
>
> Reservation '129685' (-00:01:31 -> 00:08:29 Duration: 00:10:00)
> PE: 1.00 StartPriority: 1
>
>
> Thanks for any help!
>
> - matt
>
> _______________________________________________
> mauiusers mailing list
> mauiusers at supercluster.org
> http://supercluster.org/mailman/listinfo/mauiusers
More information about the mauiusers
mailing list