[torqueusers] Pulling failed nodes based on the launched process' exit code

Peter Wyckoff wyckoff at yahoo-inc.com
Wed Sep 5 14:21:10 MDT 2007


Hi,

I am trying to make my installation as robust as possible to poorly
installed software and/or hardware problems.

I've done this by:

1. always selecting a random node as the head node for a job since I don't
want one bad node to cause a job to repeatedly fail.

2. running the health script on each node

And would like to also:

3. based on the exit code of the process/node, I would like to potentially
be able to mark a box offline or something like -n "suspicious" or
something. And in an advanced world only do this if the last X jobs failed
on this node or X out of Y.

Is there any capability in torque to do #3?  I could probably do it with a
wrapper around the process running on each node. Something like:

#!/usr/local/bin/perl

My $executable = $ARGV[1];
My $args = $ARGV[2];

`$executable $args`
If( $? != 0) {

 # mark the node as suspicious or pull it out or ...

}

Exit 0;


But, this assumes the user I run it as has permission to mark nodes offline,
which I don't especially like.

Thanks, pete



More information about the torqueusers mailing list