[Mauiusers] Maui/Torque stopped running jobs

Kevin Hildebrand kevin at umd.edu
Fri Oct 26 19:41:41 MDT 2007


On Sat, 27 Oct 2007, Jan Ploski wrote:

> Kevin Hildebrand wrote:
>> 
>> Hello, at some point today, my Maui/Torque installation stopped running 
>> jobs.  It appears that Maui is able to select an available set of nodes, 
>> but then can't seem to start the job.  I'm not getting any errors on the 
>> Torque side, or in fact, I'm not even seeing Torque log entries that the 
>> job is even being started.  Here's what I'm seeing in the Maui logs:
>> 
>> 10/26 16:43:25 INFO:     tasks located for job 21542:  2 of 2 required (36 
>> feasible)
>> 10/26 16:43:25 INFO:     allocated MNode[000]x2 
>> 'compute-2-1.deepthought.umd.edu' to 21542:0
>> 10/26 16:43:25 MJobStart(21542)
>> 10/26 16:43:25 
>> MJobDistributeTasks(21542,DEEPTHOUGHT.UMD.EDU,NodeList,TaskMap)
>> 10/26 16:43:25 INFO:     1 node(s)/2 task(s) added to 21542:0
>> 10/26 16:43:25 INFO:     MNode[000] 'compute-2-1.deepthought.umd.edu'(x2) 
>> added to job '21542'
>> [020] compute-2-1.deepthought.umd.edu: (P:4,S:5405,M:3946,D:1) 
>> [Idle][linux][[NONE]]<0.020000> C:[debug 4:4][narrow-med 4:4][narrow-long 
>> 4:4][narrow-extended 4:4][med-exten
>> ded 4:4][wide-debug 4:4][wide-short 4:4][wide-med 4:4][serial 4:4][grid 
>> 4:4][dev 4:4][DEFAULT] [noib][prod][dell1950] [debug 4:4][narrow-med 
>> 4:4][narrow-long 4:4][narrow-ex
>> tended 4:4][med-extended 4:4][wide-debug 4:4][wide-short 4:4][wide-med 
>> 4:4][serial 4:4][grid 4:4][dev 4:4]
>> 10/26 16:43:25 INFO:     end of list reached.  1 nodes found
>> 10/26 16:43:25 INFO:     tasks distributed: 2 (Round-Robin)
>> 10/26 16:43:25 MAMAllocJReserve(21542,RIndex,ErrMsg)
>> 10/26 16:43:25 MRMJobStart(21542,Msg,SC)
>> 10/26 16:43:25 INFO:     cannot start job 21542 (cannot start job - fail 
>> iteration)
>> 10/26 16:43:25 WARNING:  cannot start job '21542' through resource manager
>> 10/26 16:43:25 ERROR:    MBFFirstFit:  cannot start job 21542.0
>> 
>> Anybody have a clue as to what's going on?  (I've tried restarting both 
>> Torque and Maui, and the problem continues)
>
> What does checkjob -v 21542 tell you?
>
> Regards,
> Jan Ploski
>

Well, I walked away from it for a few hours and came back, and all of the 
stuck jobs are running.  checkjob wasn't showing anything interesting- it 
was saying that "job can run in partition DEFAULT", and there were no 
"Rejection Reasons".  Numerous nodes were available that met the job 
selection criteria.

Kevin


More information about the mauiusers mailing list