[torqueusers] PBS Error: Execution server rejected request

Garrick Staples garrick at usc.edu
Mon Nov 7 14:42:29 MST 2005


On Mon, Nov 07, 2005 at 12:08:59PM +1100, Chris Samuel alleged:
> On Sat, 5 Nov 2005 07:16 pm, garrick wrote:
> 
> > It's pretty much painless.  Just install the new daemons and restart
> > them.  Don't restart MOMs on hosts that have running jobs.
> 
> Nice - don't suppose you've got a record of which versions you've upgraded 
> between without hitting these issues ?

Unofficially, I suppose so.  This directory as every torque rpm that
I've ever had in production:

http://mirrors.usc.edu/usc/usclinux/3AS/source/common/

As you can see, some are pretty short-lived tiny jumps.  But there's
also some significant jumps like 1.0.1p6->1.1.0p4 and 1.1.0p4->1.2.0p1.
Before 1.0.1p6, I was using OpenPBS.

 
> Could be *really* handy for me as the earliest time our cluster will come 
> completely free if the running jobs hit their walltimes will be the 20th 
> January 2006.. :-(

I probably shouldn't, but I carefully mix and match different torque
versions all the time.  I've never had a problem.

 
> Also, I wonder if instead of
> 
> > ?? restart MOMs on all idle nodes
> > ?? wait a minute, make sure node and job states are updating correctly
> > ?? mark busy nodes offline
> 
> It might be safer to do something like:
> 
>  - mark all nodes offline
>  - restart MOM's on idle nodes
>  - clear offline attribute on idle nodes
> 
> Any thoughts ?

Doesn't matter since the first step is killing the scheduler.  

Also, historically, OpenPBS/TORQUE doesn't get job updates from offline
MS nodes.  That was fixed only relatively recently.

I think as long as you don't restart MOMs on active nodes, and kill the
scheduler while futzing around, it should be pretty safe no matter what
you do.

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20051107/d3c60b4c/attachment.bin


More information about the torqueusers mailing list