[torqueusers] Special job for reboot
arnaubria at pic.es
Mon Feb 1 08:33:51 MST 2010
On Mon, 01 Feb 2010 10:11:05 -0500
Axel Kohlmeyer wrote:
> On Mon, 2010-02-01 at 15:39 +0100, Arnau Bria wrote:
> hi arnau,
first of all, sorry for moving this issue of the list, did reply from
gmail web interface and I'm not used to it.
> > I'm basically refering to kernel updates. We've had a couple of
> > updates in last 3 months. Unfortunaly I can't prevent those kind of
> > things and reboot is needed.
> why can't you prevent the kernel updates? on clusters the "never
> change a winning team" type of philosophy is of paramount importance,
> if you care about the amount of time extracted and invested.
> are those updates _really_ needed? if yes, then all machines should be
> updated immediately, if not, then it can be postponed.
When we receive a critical update announce, we have 7 days for
upgrading kernels. So we have some time to do it, yes.
I could do a quick plan, drain nodes, etc... but we'd like to do it as
automatic as we can for the simplest of the action to take.
A simple script will be really nice, and seems possible to me.
> i would not want to have a parallel machine with inconsistent kernels.
Me neither. But here I have 2 options: run a mixed cluster for some
hours, plan a downtime. First solution will saved many compt time.
now I have to drain nodes by blocks and check when they are empty, then
reboot. So I'm losing computation time here. And if I plan a downtime,
some nodes will be empty before others, so those nodes will be idle for
some time that they could be running some jobs.
If a script can do it for me, I can earn many compt time. And I'm
basically talking about kernel updates, that need reboots.
> > So, for hw issues, I can conrtrol by hand, but for a kernel update,
> > it could be nice if some kind of script could do the job for me.
> > It's a simple "status" change of
> > online->offline->drain->reboot->online.
> well, in principle you can write a small script that parses pbsnodes
> -a and then launches whatever is needed,
Yep, this was an idea that I mentioned in OP, but I like much the job
requesting an entire node. I see it simplest, only a sudo conf is
> but i am always doubtful of
> such kind of automatisms. you always have to consider the usability
> and the risk of screwing up something that will take you _much_ more
> time to fix. there is such a thing as an overengineered cluster. i
> don't know how much time i have saved by being less eager to fill
> machines, and smarter how and when to schedule maintenance.
> > Maybe reservations could be a solution. I'll take a look and see how
> > many computation time I lose.
> if you cannot afford to spend the time, then you should not run kernel
> upgrades. most of the time, they are not needed at all. sometimes they
> even create problems. i only maintain the login nodes of our clusters
> that carefully, but those have very little to do.
> "if it ain't broken, don't fix it." ;)
I don't like changing things, but if I don't upgrade in 7 days, my
entire site (cluster) could be considered offline, and no jobs will be
scheduled. So, I have to.
It's not in my hands to postpone the update or to discuss about
kernel's upgrade importance.
More information about the torqueusers