[torqueusers] Pulling failed nodes based on the launched process'exit code

Martins, Flavio flavio.martins at fttinc.com
Thu Sep 6 08:55:17 MDT 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

You could probably set up an epilog script to check the exit status of
the application and mark your nodes. I believe the torque epilog script
runs as root and should have permissions to change node status.

Flavio Martins
Senior Engineer
Aerodynamics / CFD
Florida Turbine Technologies Inc.
100 Marquette Road
Suite 110
Jupiter, FL 33458-7101
Phone: (561) 427-6261
Fax: (561) 427-6191

- -----Original Message-----
From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org] On Behalf Of Peter Wyckoff
Sent: Wednesday, September 05, 2007 4:21 PM
To: torqueusers at supercluster.org
Subject: [torqueusers] Pulling failed nodes based on the launched
process'exit code


Hi,

I am trying to make my installation as robust as possible to poorly
installed software and/or hardware problems.

I've done this by:

1. always selecting a random node as the head node for a job since I
don't
want one bad node to cause a job to repeatedly fail.

2. running the health script on each node

And would like to also:

3. based on the exit code of the process/node, I would like to
potentially
be able to mark a box offline or something like -n "suspicious" or
something. And in an advanced world only do this if the last X jobs
failed
on this node or X out of Y.

Is there any capability in torque to do #3?  I could probably do it with
a
wrapper around the process running on each node. Something like:

#!/usr/local/bin/perl

My $executable = $ARGV[1];
My $args = $ARGV[2];

`$executable $args`
If( $? != 0) {

 # mark the node as suspicious or pull it out or ...

}

Exit 0;


But, this assumes the user I run it as has permission to mark nodes
offline,
which I don't especially like.

Thanks, pete

_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers

-----BEGIN PGP SIGNATURE-----
Version: PGP Universal 2.6.1
Charset: us-ascii

wsBVAwUBRuAUhzxaeRbmFE+LAQi3/Qf/X8lwMOlaIdt1GATVfTpm1XycLMWiN74m
dx6Chc7o0792SPBj5/GHXjgVgeHGd852IrOC/ZBN7NcQPMqal9di5E5E5RqJ8SdW
H5ChgLTmJyzf/sT5o01vTjATlVU6x9MxEXNDVdENtLnGjr5ia8Lt+7QBVzPaOoM8
8dhTyXCDjyCFDuHbNGV67zg7UIEE8qzvqM/zBRzlJCAIyH7N2zwsIAKxn/krBcwD
3nWUHsvP16tvVWn2bC/0UyLaOLk2U+FvBsJb/zjrkxAVkvSdTD3craKzAo4zrRSQ
N+9hd3bljeS80FEQaatGxZDJgSKWcFR7jsfVz4tMDrHqVz/FpFOHng==
=Zkm0
-----END PGP SIGNATURE-----


More information about the torqueusers mailing list