[Mauiusers] Job sat in idle for a long time even though it appeared there were resources available

Rob Lines rlinesseagate at gmail.com
Tue Jun 24 10:14:19 MDT 2008


Here is a bit from the maui.log file of a scheduling run where it did not
start:

06/23 10:22:55
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE)
06/23 10:22:55 INFO:     total jobs selected in partition ALL: 1/1
06/23 10:22:55
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE)
06/23 10:22:55 INFO:     total jobs selected in partition DEFAULT: 1/1
06/23 10:22:55 MQueueScheduleIJobs(Q,DEFAULT)
06/23 10:22:55 INFO:     180 feasible tasks found for job 42433:0 in
partition DEFAULT (20 Needed)
06/23 10:22:55 ALERT:    inadequate tasks to allocate to job 42433:0 (4 <
20)
06/23 10:22:55 ERROR:    cannot allocate nodes to job '42433' in partition
DEFAULT
06/23 10:22:55 MJobPReserve(42433,DEFAULT,ResCount,ResCountRej)
06/23 10:22:55 MJobReserve(42433,Priority)
06/23 10:22:55 INFO:     180 feasible tasks found for job 42433:0 in
partition DEFAULT (20 Needed)
06/23 10:22:55 INFO:     180 feasible tasks found for job 42433:0 in
partition DEFAULT (20 Needed)
06/23 10:22:55 INFO:     located resources for 20 tasks (140) in best
partition DEFAULT for job 42433 at time 00:00:01
06/23 10:22:55 INFO:     tasks located for job 42433:  20 of 20 required
(140 feasible)
06/23 10:22:55 MJobDistributeTasks(42433,SCFS.*FQDN*,NodeList,TaskMap)
06/23 10:22:55 MResJCreate(42433,MNodeList,00:00:01,Priority,Res)
06/23 10:22:55 INFO:     job '42433' reserved 20 tasks (partition DEFAULT)
to start in 00:00:01 on Mon Jun 23 10:22:56



Here is the one where it ran 2 minutes later (it had been submitted almost
24 hours before.

06/23 10:24:05 MStatClearUsage([NONE],Idle)
06/23 10:24:05 INFO:     total jobs selected (ALL): 1/12 [State: 11]
06/23 10:24:05
MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,FALSE)
06/23 10:24:05 INFO:     total jobs selected in partition ALL: 1/1
06/23 10:24:05 MQueueScheduleRJobs(Q)
06/23 10:24:05
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE)
06/23 10:24:05 INFO:     total jobs selected in partition ALL: 1/1
06/23 10:24:05
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE)
06/23 10:24:05 INFO:     total jobs selected in partition DEFAULT: 1/1
06/23 10:24:05 MQueueScheduleIJobs(Q,DEFAULT)
06/23 10:24:05 INFO:     180 feasible tasks found for job 42433:0 in
partition DEFAULT (20 Needed)
06/23 10:24:05 INFO:     tasks located for job 42433:  20 of 20 required (67
feasible)
06/23 10:24:05 MJobStart(42433)
06/23 10:24:05 MJobDistributeTasks(42433,SCFS.PITT.PENN.SEAGATE.COM
,NodeList,TaskMap)
06/23 10:24:05 MAMAllocJReserve(42433,RIndex,ErrMsg)
06/23 10:24:05 MRMJobStart(42433,Msg,SC)
06/23 10:24:05 MPBSJobStart(42433,SCFS.PITT.PENN.SEAGATE.COM,Msg,SC)
06/23 10:24:05
MPBSJobModify(42433,Resource_List,Resource,sc45:ppn=4+sc44:ppn=4+sc43:ppn=4+sc35:ppn=4+sc32:ppn=4)
06/23 10:24:05 MPBSJobModify(42433,Resource_List,Resource,20:ib)
06/23 10:24:05 INFO:     job '42433' successfully started
06/23 10:24:05 MStatUpdateActiveJobUsage(42433)
06/23 10:24:05 MResJCreate(42433,MNodeList,00:00:00,ActiveJob,Res)
06/23 10:24:05 INFO:     starting job '42433'
06/23 10:24:05 INFO:     1 jobs started on iteration 1378


There was a single other job running initial that was using 40 slots on 10
nodes (out of 47). There were other processes running on the nodes outside
of torque/maui but when counting by hand we found that there were more than
5 nodes with a load less than 4 so there should have been enough available
for th job to run.  Just before it ran I had loaded up a number of single
process jobs to see if they would be schedualed and it schedualed and ran
all 10 of them without a problem and then in the same iteration job 42433
ran.

>From maui.cfg

We have an entry as follows for each node though some have lower limits
because they run software outside the queue.
NODECFG[sc01]                    MAXLOAD=4.0

We also have the following in the file:
USERCFG[DEFAULT]                MAXJOB=150,200

NODEALLOCATIONPOLICY    CPULOAD
NODELOADPOLICY          ADJUSTSTATE
BACKFILLPOLICY        FIRSTFIT
RESERVATIONPOLICY     CURRENTHIGHEST
QUEUETIMEWEIGHT       1


All the nodes are identical dual dualcore cpus.

Any thoughts or suggestions are appreciated.

Thanks,
Rob
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20080624/66ac04b9/attachment.html


More information about the mauiusers mailing list