[Mauiusers] maui creates reservations for execution but doesn't start jobs

Anton Starikov ant.starikov at gmail.com
Thu May 14 21:15:54 MDT 2009


I'm using maui+torque combo for years by now, but today I've meet some  
miracle.
My current setup is

maui-3.2.6p21-1234905291
torque-2.4.0b1

And it used to work last month or so, but today stopped.

Maui creates reservations to start jobs but doesn't even try to really  
start them.
I will try to show latter what happens with job 7529.


# showres
------------------------------------------------------------------------------------
ReservationID       Type S       Start         End    Duration    N/ 
P    StartTime

7354                 Job R -2:12:24:57    11:35:03  3:00:00:00     
4/16   Tue May 12 16:24:08
7365                 Job R -2:02:15:39    21:44:21  3:00:00:00     
4/16   Wed May 13 02:33:26
7374                 Job R -1:07:41:39  1:16:18:21  3:00:00:00     
8/16   Wed May 13 21:07:26
7377                 Job R -1:02:19:23  1:21:40:37  3:00:00:00     
1/16   Thu May 14 02:29:42
7392                 Job R    -4:47:42     5:12:18    10:00:00     
1/16   Fri May 15 00:01:23
7393                 Job R    -4:27:50     5:32:10    10:00:00     
1/16   Fri May 15 00:21:15
7394                 Job R    -4:23:46     5:36:14    10:00:00     
1/16   Fri May 15 00:25:19
7397                 Job I     5:12:18    15:12:18    10:00:00     
1/16   Fri May 15 10:01:23
7529                 Job I    00:00:00  3:00:00:00  3:00:00:00     
1/16   Fri May 15 04:49:05
7561                 Job R    -4:18:30  2:19:41:30  3:00:00:00     
1/16   Fri May 15 00:30:35
7567                 Job R   -00:47:41     1:12:19     2:00:00     
1/1    Fri May 15 04:01:24
------------------------------------------------------------------------------------
As we can see, reservation is created, but job remain idle.

# showres -n
------------------------------------------------------------------------------------
node035        Job               7529       Idle   16    00:00:00   
3:00:00:00  Fri May 15 05:04:48
------------------------------------------------------------------------------------

I skipped rest of nodes here, but we see that reservation made for  
node035.


# checkjob 7529
------------------------------------------------------------------------------------
checking job 7529

State: Idle
Creds:  user:user  group:group  class:vasp  qos:DEFAULT
WallTime: 00:00:00 of 3:00:00:00
SubmitTime: Thu May 14 23:35:15
  (Time Queued  Total: 5:18:18  Eligible: 5:16:08)

Total Tasks: 16

Req[0]  TaskCount: 16  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
Dedicated Resources Per Task: PROCS: 1  MEM: 1856M


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 1
PartitionMask: [D1][Q1][Q2][huge]
Flags:       RESTARTABLE

Reservation '7529' (00:00:00 -> 3:00:00:00  Duration: 3:00:00:00)
PE:  16.00  StartPriority:  396
job cannot run in partition D1 (insufficient idle procs available: 0 <  
16)

cannot select job 7529 for partition D2 (PartitionAccess)

job cannot run in partition Q1 (insufficient idle procs available: 0 <  
16)

job cannot run in partition Q2 (insufficient idle procs available: 0 <  
16)

cannot select job 7529 for partition check (PartitionAccess)

cannot select job 7529 for partition test (PartitionAccess)

job can run in partition huge (16 procs available.  16 procs required)
------------------------------------------------------------------------------------
So, here it is OK.

# diagnose -j 7529
------------------------------------------------------------------------------------
Name                  State Par Proc QOS     WCLimit R  Min      
User    Group  Account  QueuedTime  Network  Opsys   Arch    Mem    
Disk  Procs      Class Features

7529                   Idle ALL   16 DEF  3:00:00:00 1   16 user  
group        -     5:17:58   [NONE] [NONE] [NONE]    >=0    >=0     
NC0    [vasp:1] [NONE]
------------------------------------------------------------------------------------

And parts of log related to this job (collected after maui restart):

------------------------------------------------------------------------------------
05/15 04:58:24 MJobFind('7529',J,0)
05/15 04:58:24 MJobCreate(7529,TRUE,JP)
05/15 04:58:24 MRMJobPreLoad(J,7529,0)
05/15 04:58:24 MPBSJobLoad(7529,7529.master,J,TaskList,0)
05/15 04:58:24 MReqCreate(7529,SrcRQ,DstRQ,DoCreate)
05/15 04:58:24 INFO:     adding requirement at slot 0
05/15 04:58:24 __MPBSGetTaskList(7529,16,NULL,0)
05/15 04:58:24 INFO:     processing node request line '16'
05/15 04:58:24 __MPBSGetTaskList(7529,16,NULL,0)
05/15 04:58:24 MJobSetCreds(7529,user,group,)
05/15 04:58:24 INFO:     default QOS for job 7529 set to DEFAULT(0)  
(P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
05/15 04:58:24 MRMJobPostLoad(7529,TaskList,MASTER)
05/15 04:58:24 MCPRestore(JOB,7529,Optr)
05/15 04:58:24 MJobLoadCP(JS,JOB                       7529 1242356159  
<job AWDuration="0" EEWDuration="19086" Hold="0"  
ReqSMinTime="00:00:00" StartCou
05/15 04:58:24 MUGetIndex(AWDuration,ValList,0)
05/15 04:58:24 MUGetIndex(EEWDuration,ValList,0)
05/15 04:58:24 MUGetIndex(Hold,ValList,0)
05/15 04:58:24 MUGetIndex(ReqSMinTime,ValList,0)
05/15 04:58:24 MUTimeFromString(00:00:00)
05/15 04:58:24 INFO:     string '00:00:00' -> 0
05/15 04:58:24 MUGetIndex(StartCount,ValList,0)
05/15 04:58:24 MUGetIndex(StatMSUtl,ValList,0)
05/15 04:58:24 MUGetIndex(StatPSDed,ValList,0)
05/15 04:58:24 MUGetIndex(StatPSUtl,ValList,0)
05/15 04:58:24 MUGetIndex(SuspendDuration,ValList,0)
05/15 04:58:24 MUGetIndex(SysPrio,ValList,0)
05/15 04:58:24 MUGetIndex(SysSMinTime,ValList,0)
05/15 04:58:24 MUTimeFromString(00:00:00)
05/15 04:58:24 INFO:     string '00:00:00' -> 0
05/15 04:58:24 INFO:     system min start time set on job  for  - 
INFINITY
05/15 04:58:24 INFO:     default QOS for job 7529 set to DEFAULT(0)  
(P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
05/15 04:58:24 MUNLFromTL(NL,TL)
05/15 04:58:24 INFO:     default QOS for job 7529 set to DEFAULT(0)  
(P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
05/15 04:58:24 INFO:     job '7529' loaded:  16 user group  
259200       Idle   0 1242336915   [NONE] [NONE] [NONE] >=      0  
 >=      0 [NONE] 12
------------------------------------------------------------------------------------

05/15 04:58:24 MJobCheckPolicies(7529,HARD,2,ALL,RIndex,NULL,2140000000)
05/15 04:58:24 INFO:     job '7529' added to queue at slot 73
------------------------------------------------------------------------------------

05/15 04:58:24 INFO:     job 7529 not considered for spanning

------------------------------------------------------------------------------------

05/15 04:58:24 MReqCheckResourceMatch(7529,0,node035,NULL)
05/15 04:58:24 INFO:     node node035 can provide resources for job  
7529:0
05/15 04:58:24 MJobCheckNRes(7529,node035,RQ[0],  INFINITY,TCAvail, 
1.000,RIndex,NULL,FeasCheck)
05/15 04:58:24 MReqCheckResourceMatch(7529,0,node035,RIndex)
05/15 04:58:24 INFO:     node node035 can provide resources for job  
7529:0
------------------------------------------------------------------------------------

05/15 04:58:24 INFO:     node node035 supports 16 tasks of job 7529:0  
for   INFINITY of 3:00:00:00 (no reservation)

------------------------------------------------------------------------------------

05/15 04:58:24 INFO:     96 feasible tasks found for job 7529:0 in  
partition huge (16 Needed)
05/15 04:58:24 INFO:     C[00]  S: 1242392150  E: 2139740800  T:  16   
N: 1
05/15 04:58:24 INFO:     C[00]  S: 1242375675  E: 1242392150  T:  16   
N: 1
05/15 04:58:24 INFO:     C[01]  S: 1242392150  E: 2139740800  T:  32   
N: 2
05/15 04:58:24 INFO:     node node035 supports 16 tasks of job 7529:0  
for   INFINITY of 3:00:00:00 (no reservation)
05/15 04:58:24 INFO:     C[00]  S: 1242356304  E: 1242375675  T:  16   
N: 1
05/15 04:58:24 INFO:     C[01]  S: 1242375675  E: 1242392150  T:  32   
N: 2
05/15 04:58:24 INFO:     C[02]  S: 1242392150  E: 2139740800  T:  48   
N: 3
05/15 04:58:24 INFO:     C[00]  S: 1242356304  E: 1242375675  T:  16   
N: 1
05/15 04:58:24 INFO:     C[01]  S: 1242375675  E: 1242375919  T:  32   
N: 2
05/15 04:58:24 INFO:     C[02]  S: 1242375919  E: 1242392150  T:  48   
N: 3
05/15 04:58:24 INFO:     C[03]  S: 1242392150  E: 2139740800  T:  64   
N: 4
05/15 04:58:24 INFO:     C[00]  S: 1242356304  E: 1242375675  T:  16   
N: 1
05/15 04:58:24 INFO:     C[01]  S: 1242375675  E: 1242375919  T:  32   
N: 2
05/15 04:58:24 INFO:     C[02]  S: 1242375919  E: 1242392150  T:  48   
N: 3
05/15 04:58:24 INFO:     C[03]  S: 1242392150  E: 1242599435  T:  64   
N: 4
05/15 04:58:24 INFO:     C[04]  S: 1242599435  E: 2139740800  T:  80   
N: 5
05/15 04:58:24 INFO:     C[00]  S: 1242356304  E: 1242375675  T:  16   
N: 1
05/15 04:58:24 INFO:     C[01]  S: 1242375675  E: 1242375919  T:  32   
N: 2
05/15 04:58:24 INFO:     C[02]  S: 1242375919  E: 1242392150  T:  48   
N: 3
05/15 04:58:24 INFO:     C[03]  S: 1242392150  E: 1242520182  T:  64   
N: 4
05/15 04:58:24 INFO:     C[04]  S: 1242520182  E: 1242599435  T:  80   
N: 5
05/15 04:58:24 INFO:     C[05]  S: 1242599435  E: 2139740800  T:  96   
N: 6
05/15 04:58:24 MJobSelectFRL(7529,G,1,RCount)
05/15 04:58:24 INFO:     start time 00:00:00 found for job 7529 in  
partition huge (1242356304)
05/15 04:58:24 MJobGetRange(7529,RQ,huge, 
00:00:00,GRange,MAvlNodeList,NodeMap,8,NULL)

------------------------------------------------------------------------------------

05/15 04:58:24 INFO:     node node035 supports 16 tasks of job 7529:0  
for   INFINITY of 3:00:00:00 (no reservation)
05/15 04:58:24 INFO:     located resources for 16 tasks (16) in best  
partition huge for job 7529 at time 00:00:00
05/15 04:58:24  
MJobAllocMNL(7529,MFeasibleList,NodeMap,MOutList,MINRESOURCE,1242356304)
05/15 04:58:24 INFO:     evaluating nodes on alloc iteration 0 for job  
7529:0
05/15 04:58:24 INFO:     evaluating nodes on alloc iteration 1 for job  
7529:0
05/15 04:58:24 INFO:     evaluating nodes on alloc iteration 2 for job  
7529:0
05/15 04:58:24 INFO:     evaluating nodes on alloc iteration 3 for job  
7529:0
05/15 04:58:24 INFO:     evaluating nodes on alloc iteration 4 for job  
7529:0
05/15 04:58:24 INFO:     evaluating nodes on alloc iteration 5 for job  
7529:0
05/15 04:58:24 INFO:     tasks located for job 7529:  16 of 16  
required (16 feasible)
05/15 04:58:24 INFO:     allocated MNode[000]x16 'node035' to 7529:0
05/15 04:58:24 MJobDistributeTasks(7529,MASTER,NodeList,TaskMap)
05/15 04:58:24 INFO:     1 node(s)/16 task(s) added to 7529:0
05/15 04:58:24 INFO:     MNode[000] 'node035'(x16) added to job '7529'
[031] node035: (P:16,S:64297,M:64619,D:1) [Idle][linux] 
[[NONE]]<0.000000> C:[normal 16:16][vasp 16:16][test 16:16][check  
16:16][batch 16:16][DEFAULT] [
05/15 04:58:24 INFO:     end of list reached.  1 nodes found
05/15 04:58:24 INFO:     tasks distributed: 16 (Round-Robin)
05/15 04:58:24 MResJCreate(7529,MNodeList,00:00:00,Priority,Res)
05/15 04:58:24 MResAddNode(7529,node035,16,0)
05/15 04:58:24 MResAdjustDRes(7529,FALSE)
05/15 04:58:24 INFO:     job '7529' reserved 16 tasks (partition huge)  
to start in 00:00:00 on Fri May 15 04:58:24
(WC: 259200)

------------------------------------------------------------------------------------

05/15 04:58:24 MJobPReserve(7529,huge,ResCount,ResCountRej)
05/15 04:58:24 INFO:     no priority reservations created (existing  
reservation)

------------------------------------------------------------------------------------

05/15 04:58:24 INFO:     job 7529 not considered for spanning
------------------------------------------------------------------------------------


I have to admit, that I don't have ideas what is going on. Yesterday  
everything it used to work. Today, jobs a blocking cluster. Only one  
way to force them to pass is to set backfill depth to 0.

Nothing happened between yesterday and today. No actions, but still  
changes, isn't it a miracle? :)




--
Anton Starikov.


More information about the mauiusers mailing list