[torqueusers] Pulling failed nodes based on the launched process'
exit code
Peter Wyckoff
wyckoff at yahoo-inc.com
Wed Sep 5 14:21:10 MDT 2007
Hi,
I am trying to make my installation as robust as possible to poorly
installed software and/or hardware problems.
I've done this by:
1. always selecting a random node as the head node for a job since I don't
want one bad node to cause a job to repeatedly fail.
2. running the health script on each node
And would like to also:
3. based on the exit code of the process/node, I would like to potentially
be able to mark a box offline or something like -n "suspicious" or
something. And in an advanced world only do this if the last X jobs failed
on this node or X out of Y.
Is there any capability in torque to do #3? I could probably do it with a
wrapper around the process running on each node. Something like:
#!/usr/local/bin/perl
My $executable = $ARGV[1];
My $args = $ARGV[2];
`$executable $args`
If( $? != 0) {
# mark the node as suspicious or pull it out or ...
}
Exit 0;
But, this assumes the user I run it as has permission to mark nodes offline,
which I don't especially like.
Thanks, pete
More information about the torqueusers
mailing list