[torqueusers] strategies for bad nodes

Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu
Tue Apr 17 07:25:16 MDT 2012


hello william,

On Tue, Apr 17, 2012 at 9:02 AM, Edsall, William (WJ) <WJEdsall at dow.com> wrote:
> Hello list,
>
> I’m looking for ideas on how to prevent jobs from going to ‘bad’ nodes.
> There are a small handful of items which define a bad node for us such as
> ypbind not bound, maybe /scr is full, etc. We need to be able to customize
> this list.

> What might be built into torque to achieve this? It would be ideal if the
> node was not only passed by for a job but even offlined with a comment.

yes. you can do this via a node check script.

http://www.adaptivecomputing.com/resources/docs/torque/2-5-9/11.2healthcheck.php

we use it to determine known problematic conditions
or pre-failure warnings and have the node go offline.

cheers,
    axel.

>
>
>
> Thanks,
>
> William
>
>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



-- 
Dr. Axel Kohlmeyer    akohlmey at gmail.com
http://sites.google.com/site/akohlmey/

Institute for Computational Molecular Science
Temple University, Philadelphia PA, USA.


More information about the torqueusers mailing list