Kevin Hildebrand kevin at umd.edu
Fri Oct 26 14:50:37 MDT 2007

Hello, at some point today, my Maui/Torque installation stopped running 
jobs.  It appears that Maui is able to select an available set of nodes, 
but then can't seem to start the job.  I'm not getting any errors on the 
Torque side, or in fact, I'm not even seeing Torque log entries that the 
job is even being started.  Here's what I'm seeing in the Maui logs:

10/26 16:43:25 INFO:     tasks located for job 21542:  2 of 2 required (36 
10/26 16:43:25 INFO:     allocated MNode[000]x2 
'compute-2-1.deepthought.umd.edu' to 21542:0
10/26 16:43:25 MJobStart(21542)
10/26 16:43:25 
10/26 16:43:25 INFO:     1 node(s)/2 task(s) added to 21542:0
10/26 16:43:25 INFO:     MNode[000] 'compute-2-1.deepthought.umd.edu'(x2) 
added to job '21542'
[020] compute-2-1.deepthought.umd.edu: (P:4,S:5405,M:3946,D:1) 
[Idle][linux][[NONE]]<0.020000> C:[debug 4:4][narrow-med 4:4][narrow-long 
4:4][narrow-extended 4:4][med-exten
ded 4:4][wide-debug 4:4][wide-short 4:4][wide-med 4:4][serial 4:4][grid 
4:4][dev 4:4][DEFAULT] [noib][prod][dell1950] [debug 4:4][narrow-med 
4:4][narrow-long 4:4][narrow-ex
tended 4:4][med-extended 4:4][wide-debug 4:4][wide-short 4:4][wide-med 
4:4][serial 4:4][grid 4:4][dev 4:4]
10/26 16:43:25 INFO:     end of list reached.  1 nodes found
10/26 16:43:25 INFO:     tasks distributed: 2 (Round-Robin)
10/26 16:43:25 MAMAllocJReserve(21542,RIndex,ErrMsg)
10/26 16:43:25 MRMJobStart(21542,Msg,SC)
10/26 16:43:25 INFO:     cannot start job 21542 (cannot start job - fail 
10/26 16:43:25 WARNING:  cannot start job '21542' through resource manager
10/26 16:43:25 ERROR:    MBFFirstFit:  cannot start job 21542.0

Anybody have a clue as to what's going on?  (I've tried restarting both 
Torque and Maui, and the problem continues)


Kevin Hildebrand
University of Maryland, College Park

