[Mauiusers] Jobs in Queue Forever

Chris Samuel csamuel at vpac.org
Wed Nov 3 15:21:55 MST 2004


On Wed, 3 Nov 2004 02:44 am, C. D. Poon wrote:

> In the Maui log, I get the following lines.
>
> 11/02 10:11:02 ERROR:    job '503' cannot be started: (rc: 15044  
> errmsg: 'Resource temporarily unavailable'  hostlist:
> 'bc02-n07.isis.unc.edu')
> 11/02 10:11:02 ERROR:    job '504' cannot be started: (rc: 15044  
> errmsg: 'Resource temporarily unavailable'  hostlist:
> 'bc02-n07.isis.unc.edu')
>
> Has anyone seen this problem before?

I believe I have.

> Is it OpenPBS problem or Maui or a combination of both?

I suspect it's an OpenPBS bug that was recently fixed in Torque.

> Would it be a problem with configuration in either OpenPBS or Maui?

Nope, if my guess is right then the pbs_mom on 'bc02-n07.isis.unc.edu' has 
been restarted recently or there has been some other problem that has caused 
that node to reject jobs.

PBS remembers the node that Maui allocated for it in the 'neednodes' attribute 
(or it may be that's how Maui tells PBS where to run the job, the effect is 
identical), Maui then reloads the job list and sees that the job has asked 
for that specific node and then will not try and reallocate it another node.

If I'm right then if you do a qstat -f on the affected jobs you will see a 
'neednodes' value set to that hostname (or including that hostname if it's a 
multi-CPU job).

It's fairly easy to work around, as administrator you reset neednodes to the 
value that the job was asking for initially, so if it job 503 is a single CPU 
job you could do:

 qalter -l neednodes=1 503

and that should clear the information it has learnt.

I *think* you may also need to restart Maui, although if you're lucky just 
doing:

 schedctl -s
 schedctl -r

to pause it and then restart it to make it to re-read the queue immediately 
may be enough (you may even get away with just a 'schedctl -r').

Of course I reserve the right to be completely wrong on this and it could be 
something else altogether... :-)

cheers,
Chris
-- 
 Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin
 Victorian Partnership for Advanced Computing http://www.vpac.org/
 Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/mauiusers/attachments/20041104/3a6c1cb8/attachment.bin


More information about the mauiusers mailing list