[torqueusers] Special job for reboot

Arnau Bria arnaubria at pic.es
Mon Feb 1 08:33:51 MST 2010


On Mon, 01 Feb 2010 10:11:05 -0500
Axel Kohlmeyer wrote:

> On Mon, 2010-02-01 at 15:39 +0100, Arnau Bria wrote:
> 
> hi arnau,
Hi Axel,

first of all, sorry for moving this issue of the list, did reply from
gmail web interface and I'm not used to it.
  
> > I'm basically refering to kernel updates. We've had a couple of
> > updates in last 3 months. Unfortunaly  I can't prevent those kind of
> > things and reboot is needed.
> 
> why can't you prevent the kernel updates? on clusters the "never
> change a winning team" type of philosophy is of paramount importance,
> if you care about the amount of time extracted and invested.
> are those updates _really_ needed? if yes, then all machines should be
> updated immediately, if not, then it can be postponed.

When we receive a critical update announce, we have 7 days for
upgrading kernels. So we have some time to do it, yes. 
I could do a quick plan, drain nodes, etc... but we'd like to do it as
automatic as we can for the simplest of the action to take.

A simple script will be really nice, and seems possible to me.

> i would not want to have a parallel machine with inconsistent kernels.

Me neither. But here I have 2 options: run a mixed cluster for some
hours, plan a downtime. First solution will saved many compt time.
Second, no.
now I have to drain nodes by blocks and check when they are empty, then
reboot. So I'm losing computation time here. And if I plan a downtime,
some nodes will be empty before others, so those nodes will be idle for
some time that they could be running some jobs.

If a script can do it for me, I can earn many compt time. And I'm
basically talking about kernel updates, that need reboots.


[...]

> > So, for hw issues, I can conrtrol by hand, but for a kernel update,
> > it could be nice if some kind of script could do the job for me.
> > It's a simple "status" change of
> > online->offline->drain->reboot->online.
> 
> well, in principle you can write a small script that parses pbsnodes
> -a and then launches whatever is needed, 
Yep, this was an idea that I mentioned in OP, but I like much the job
requesting an entire node. I see it simplest, only a sudo conf is
needed.

> but i am always doubtful of
> such kind of automatisms. you always have to consider the usability
> and the risk of screwing up something that will take you _much_ more
> time to fix. there is such a thing as an overengineered cluster. i
> don't know how much time i have saved by being less eager to fill
> machines, and smarter how and when to schedule maintenance.   

[...]

> > Maybe reservations could be a solution. I'll take a look and see how
> > many computation time I lose.
> 
> if you cannot afford to spend the time, then you should not run kernel
> upgrades. most of the time, they are not needed at all. sometimes they
> even create problems. i only maintain the login nodes of our clusters
> that carefully, but those have very little to do.
> 
> "if it ain't broken, don't fix it." ;)

I don't like changing things, but if I don't upgrade in 7 days, my
entire site (cluster) could be considered  offline, and no jobs will be
scheduled. So, I have to. 
It's not in my hands to postpone the update or to discuss about
kernel's upgrade importance. 
 
> cheers,
>     axel.
Cheers,
Arnau


More information about the torqueusers mailing list