[torqueusers] Special job for reboot

Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu
Thu Jan 28 08:17:46 MST 2010


On Thu, Jan 28, 2010 at 9:05 AM, Arnau Bria <arnaubria at pic.es> wrote:
> Hi all,

hi arnau,

> this issue is a little OT, but I'd like to know other admin experiences.
>
> Someone already asked this some time ago:
>
> http://www.supercluster.org/pipermail/torqueusers/2008-December/008373.html
>
> But I don't find the solution he implemented and if it worked or not.

i don't quite see this as a solution. it is the kind of thing that people
have done to Windows NT servers. i have lots of cluster nodes that
get heavily pounded and can keep them running unless i am forced
to reboot due to some external causes. often with uptimes of a year
or more. if machines need to be rebooted that often, i would try to find
out why and then eliminate that cause to handle that specifically.
for example, we have nodes with (very) old myrinet hardware, where
we get the occasional crash in the firmware (due to overload). this
can be detected from checking the output of "dmesg", so i have configured
a node health check script, execute it regularly and at the beginning and
end of any job and then have the node set itself offline (through that script)
in case there is some known error condition (or not enough swap etc.).

all that is needed to check for nodes that are offline, see why there are
offline (i have script for that) and take appropriate measures. i do that
regularly or whenever there is time. => little overhead, it doesn't
automatically
destroy evidence of problems, it is transparent for users.

> I've seen a couple of good ideas like the one from Brock Palen
> recommending a job that requests a complet node and special host (#PBS
> -l host=$host,naccesspolicy=SINGLEJOB) and the other from Garrick :
>
> "First, you need to drain the nodes by marking them offline.  Then you
> need to mark them for reboot using the node note.  Then a script can
> reboot nodes when it finds them offline, without a job, and marked for
> reboot."
>
> But is someone really doing reboot via torque? What are your steps when
> you need to reboot your farm?

for scheduled reboots i used to stop queues, or manually set nodes
offline or both
when i was using the torque fifo scheduler. with maui as scheduler, i just put
in a reservation for the time i want to do the reboot ahead of time
and the nodes
will be empty when i need them to be.

cheers,
    axel.

> Any experience will be welcome!
>
> Cheers,
> Arnau
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



-- 
Dr. Axel Kohlmeyer    akohlmey at gmail.com
Institute for Computational Molecular Science
College of Science and Technology
Temple University, Philadelphia PA, USA.


More information about the torqueusers mailing list