[torquedev] Should a communication error between pbs_mom's kill a job ?

Chris Samuel csamuel at vpac.org
Mon May 18 18:38:24 MDT 2009


----- "Joshua Bernstein" <jbernstein at penguincomputing.com> wrote:

> Exactly. I was thinking along the lines of the same thing. HP-MPI can
> do interconnect failover which is helpful in some circumstances.

I think OpenMPI can do something like that (or at least
it certainly does it when it finds an interconnect down
at job start, for instance if an IB card has failed).

cheers,
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency


More information about the torquedev mailing list