[torqueusers] Epilogoue.parallel scripts

Garrick Staples garrick at usc.edu
Wed Sep 12 11:37:09 MDT 2007


On Wed, Sep 12, 2007 at 10:39:21AM -0700, Peter Wyckoff alleged:
> 
> I notice in the docs that they don't get the exit code of the job run with
> qsub or pbsdsh in their environment. Is there a way to get this other than
> grepping the mom_logs?

You want to mark nodes offline based on the exit code of the job?  So the next
time someone does 'echo blah blah blah | qsub', nodes get marked offline?

But to answer your question, no.  I don't think sister nodes ever get the exit
value of the job.

You could always do this work from the normal epilogue.

 
> Also, this is run as root on a compute node so can't run pbsnodes -o
> <localhost>  to take a bad machine out.

Why not?
 

> It can run momctl -s, but that isn't as nice as taking it offline. Is there
> another way to do this?

The health check script is really designed for this purpose.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20070912/0f6e348c/attachment.bin


More information about the torqueusers mailing list