[torqueusers] PBS Error: Execution server rejected request
Garrick Staples
garrick at usc.edu
Mon Nov 7 14:42:29 MST 2005
On Mon, Nov 07, 2005 at 12:08:59PM +1100, Chris Samuel alleged:
> On Sat, 5 Nov 2005 07:16 pm, garrick wrote:
>
> > It's pretty much painless. Just install the new daemons and restart
> > them. Don't restart MOMs on hosts that have running jobs.
>
> Nice - don't suppose you've got a record of which versions you've upgraded
> between without hitting these issues ?
Unofficially, I suppose so. This directory as every torque rpm that
I've ever had in production:
http://mirrors.usc.edu/usc/usclinux/3AS/source/common/
As you can see, some are pretty short-lived tiny jumps. But there's
also some significant jumps like 1.0.1p6->1.1.0p4 and 1.1.0p4->1.2.0p1.
Before 1.0.1p6, I was using OpenPBS.
> Could be *really* handy for me as the earliest time our cluster will come
> completely free if the running jobs hit their walltimes will be the 20th
> January 2006.. :-(
I probably shouldn't, but I carefully mix and match different torque
versions all the time. I've never had a problem.
> Also, I wonder if instead of
>
> > ?? restart MOMs on all idle nodes
> > ?? wait a minute, make sure node and job states are updating correctly
> > ?? mark busy nodes offline
>
> It might be safer to do something like:
>
> - mark all nodes offline
> - restart MOM's on idle nodes
> - clear offline attribute on idle nodes
>
> Any thoughts ?
Doesn't matter since the first step is killing the scheduler.
Also, historically, OpenPBS/TORQUE doesn't get job updates from offline
MS nodes. That was fixed only relatively recently.
I think as long as you don't restart MOMs on active nodes, and kill the
scheduler while futzing around, it should be pretty safe no matter what
you do.
--
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20051107/d3c60b4c/attachment.bin
More information about the torqueusers
mailing list