[Mauiusers] Job stucks in queue - needed procs are available

Thomas Dargel td at chemie.hu-berlin.de
Fri Mar 3 03:36:39 MST 2006


Hi all,

I have three types of queues defined on my cluster (x86-64, SLES9SP3,
torque120p6,maui326p14-snap1129921819) - 'cpu-2' ==> dual-opterons, 'cpu-8'
==> octuple-opteron and 'dualcore' ==> dual-core dual-opteron.
Everthings runs fine until the cpu-2 queue gets full with jobs and some
'cpu-2' jobs are queued. In this situation no pending 'dualcore' job
starts even though there is an idle cpu at that machine.

Has anybody an idea, what I configured wrong, any help is apreciated,
thanks in advance,

 Thomas Dargel.

Attached: outputs from 'checkjob 6010' (the queued job), 
          'checknode node24' (the dualcore machine)
          and a snippet of the logfile (LOGLEVEL 6).
          I will provide more information, if needed.
-- 
--------------------------------------------------------------------------------
 Thomas Dargel            Raum: 3'325           Tel.: +(49)30 2093-7143/4
    Humboldt-Universitaet zu Berlin             Fax.: +(49)30 2093-7136
    Institut fuer Chemie
    AG Quantenchemie, Prof. Sauer
    Brook-Taylor-Str. 2                         Mail: td AT chemie.hu-berlin.de
  D-12489 Berlin - Adlershof
--------------------------------------------------------------------------------
-------------- next part --------------
03/03 11:09:28 MReqCheckResourceMatch(BFWindow,0,node24,RIndex)
03/03 11:09:28 INFO:     node node24 can provide resources for job BFWindow:0
03/03 11:09:28 MJobCheckNRes(BFWindow,node24,RQ[0],00:00:00,TCAvail,1.000,RIndex,Affinity,FeasCheck)
03/03 11:09:28 MReqCheckResourceMatch(BFWindow,0,node24,RIndex)
03/03 11:09:28 INFO:     node node24 can provide resources for job BFWindow:0
03/03 11:09:28 MJobCheckNStartTime(BFWindow,RQ,node24,00:00:00,TasksAllowed,1.000000,RIndex,Affinity)
03/03 11:09:28 MRECheck(node24,MJobGetSNRange-Start,FORCE)
03/03 11:09:28 INFO:     resources available at time -1:22:24:30 during 6006 start
03/03 11:09:28 INFO:     adjusting 'preactive' ARange[0] taskcount from 3 to 2
03/03 11:09:28 INFO:     adjusting 'preactive' ARange[0] taskcount from 2 to 1
03/03 11:09:28 INFO:     ARange[1] (1149853497 -> 1149878887)x2 too late for job BFWindow by 8472929
03/03 11:09:28 INFO:     ARange[2] (1149878887 -> 1149944218)x3 too late for job BFWindow by 8498319
03/03 11:09:28 INFO:     ARange[3] (1149944218 -> 2140000000)x4 too late for job BFWindow by 8563650
03/03 11:09:28 INFO:     node node24 supports 1 task  of job BFWindow:0 for 98:08:38:39 at 00:00:00
03/03 11:09:28 MRECheck(node24,MJobGetSNRange-Start,FORCE)
03/03 11:09:28 INFO:     resources available at time -1:22:24:30 during 6006 start
03/03 11:09:28 INFO:     adjusting 'preactive' ARange[0] taskcount from 3 to 2
03/03 11:09:28 INFO:     adjusting 'preactive' ARange[0] taskcount from 2 to 1
03/03 11:09:28 INFO:     ARange[1] (1149853497 -> 1149878887)x2 too late for job BFWindow by 8472929
03/03 11:09:28 INFO:     ARange[2] (1149878887 -> 1149944218)x3 too late for job BFWindow by 8498319
03/03 11:09:28 INFO:     ARange[3] (1149944218 -> 2140000000)x4 too late for job BFWindow by 8563650
03/03 11:09:28 INFO:     node node24 supports 1 task  of job BFWindow:0 for 98:08:38:39 at 00:00:00
03/03 11:09:28 INFO:     backfill window:  time:   INFINITY  nodes:   0  tasks:   0  mintime: 8498319 (idle nodes: 0)
03/03 11:09:28 MPolicyAdjustUsage(NULL,6074,NULL,idle,PU,[ALL],-1,NULL)

-------------- next part --------------


checking node node24

State:   Running  (in current state for 00:00:00)
Configured Resources: PROCS: 4  MEM: 7968M  SWAP: 7968M  DISK: 1M
Utilized   Resources: [NONE]
Dedicated  Resources: PROCS: 3  MEM: 5970M
Opsys:         linux  Arch:      [NONE]
Speed:      1.00  Load:       2.990
Location:   Partition: DEFAULT  Frame/Slot:  1/1
Network:    [DEFAULT]
Features:   [dc]
Attributes: [Batch]
Classes:    [cpu-2 4:4][cpu-8 4:4][mixpipe 4:4][dualcore 1:4]

Total Time: 76:02:55:49  Up: 69:06:05:46 (90.98%)  Active: 45:07:30:38 (59.53%)

Reservations:
  Job '6008'(x1)  -1:15:25:51 -> 98:08:34:08 (99:23:59:59)
  Job '6006'(x1)  -1:22:29:01 -> 98:01:30:58 (99:23:59:59)
  Job '6009'(x1)  -21:17:00 -> 99:02:42:59 (99:23:59:59)
JobList:  6006,6008,6009

-------------- next part --------------


checking job 6010 (RM job '6010.cnode01.mauicluster')

State: Idle
Creds:  user:jd  group:qc  class:dualcore  qos:DEFAULT
WallTime: 00:00:00 of 99:23:59:59
SubmitTime: Wed Mar  1 10:00:06
  (Time Queued  Total: 2:01:12:53  Eligible: 2:01:12:53)

Total Tasks: 1

Req[0]  TaskCount: 1  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [dc]
Exec:  ''  ExecSize: 0  ImageSize: 0
Dedicated Resources Per Task: PROCS: 1  MEM: 1990M
NodeAccess: SHARED
TasksPerNode: 1  NodeCount: 1


IWD: [NONE]  Executable:  [NONE]
Bypass: 13  StartCount: 0
PartitionMask: [ALL]
Flags:       RESTARTABLE

PE:  1.00  StartPriority:  97387
job can run in partition DEFAULT (1 procs available.  1 procs required)



More information about the mauiusers mailing list