[torqueusers] Job eligible, nodes free, but job would not start

Garrick Staples garrick at clusterresources.com
Fri Oct 13 12:48:30 MDT 2006


On Thu, Oct 12, 2006 at 06:58:09PM -0400, Neelesh Arora alleged:
> Hi All,
> 
> I am using torque-2.0.0p2 and maui-3.2.6p13, and notice the following 
> behavior today:

Obviously I recommend an upgrade to 2.1.3, but it probably wouldn't fix
anything related to this problem.


> - There are several jobs in the queue that are in the Q state. When I do 
> checkjob <jobid>, I get (among other things):
> "job can run in partition DEFAULT (63 procs available.  1 procs required)"
> but the job remains in Q forever. It is not the case of a resource 
> requirement not being met (as the above message indicates)

That means a reservation is set preventing the jobs from running.

 
> - nothing untoward in the torque logs
> 
> - I see several of these messages in maui.log:
> MSysRegEvent(JOBCORRUPTION:  job 'jobid' has the following idle node(s) 
> allocated: 'node114' ,0,0,1)
> but these are for the running jobs, not the Q'ed jobs in question

"busy" in TORQUE means the node has a load average (and you've
configured $ideal_load/$max_load), not that a job is assigned.  

"idle" in maui means that TORQUE is reporting the node as
"free", which is related to load average, not having jobs assigned.

Maui does expect the nodes to be loaded according to the number of CPUs
assigned, but "CORRUPTION" is probably too strong of a word.


> - I also see messages like these in the maui.log:
> INFO:     PBS node node114 set to state Idle (free)
> INFO:     node 'node114' changed states from Running to Idle
> although, this node has 2 out of 4 procs busy
> this message is repeated for several nodes.

Again, this is more related to load average than assigned jobs.

 
> - restarting torque and maui did not help either

Look at the reservations preventing the job from running.

 
> - if I say qrun <jobid> for the stuck jobs, I get:
> qrun: Resource temporarily unavailable <jobid>
> 
> - but if I do runjob <jobid>, the jobs are started !!
> 
> I am unable to correlate all this information. Does anyone know what can 
> be going wrong, or where else can I hunt for things?
> 
> Thanks.
> 
> -Neel
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list