[torqueusers] reply code=15001...

Gonzalo Merino merino at pic.es
Sun Oct 28 14:18:29 MDT 2007


Hi,

We also see sometimes the job failing due to what looks like the pbs_mom 
failing to scp the "input" files from the master node.
We thought this could look like the race condition that was described 
above in this thread. Since we have a prologue script that can be a bit 
long sometimes, when the pbs_mom tries to scp input files, they might 
have been already deleted at the origin, or something like this.

So, can somebody confirm wether this patch is not yet in the current 
maui release? (looks like an important and old enough issue...)

thanks a lot,
Gonzalo

Garrick Staples escribió:
> On Thu, Oct 25, 2007 at 01:12:32PM -0400, nathaniel.x.woody at gsk.com alleged:
>> Huh, to follow up on this, what are the rare Bad Things that can happen 
>> here (I decided years ago to ignore the millions of these we get)? 
> 
> Since maui/moab is temporarily setting the nodes request to the full nodelist
> (replacing it with the original request after the job start), failed job starts
> can leave the job tied to specific nodes instead of simply being retried on
> other nodes.  The worst case is a node going down during the job start leaving
> the job impossible to run.
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list