[torqueusers] job execution error

Garrick Staples garrick at clusterresources.com
Thu Oct 5 14:10:11 MDT 2006


On Thu, Oct 05, 2006 at 10:42:28AM -0700, Sam Rash alleged:
> So we've got a 'drone script' that we've been running through this torque
> server 10k times a day w/o problems.  Suddenly one node gets this in the
> stderr (.ER file) for a job:
> 
>  
> 
> -bash: line 1: /home/y/var/pbs/mom_priv/jobs/1899889.med.SC: No such file or
> directory
> 
>  
> 
> Isn't that the generated script PBS makes for you when you do echo my
> command | qsub ?

Or passing a script to qsub.  Both methods generate a
mom_priv/jobs/seq.host.SC file.

 
> Does this simply mean
> 
> 1)       it wasn't created somehow? (newly created bug in our setup, newly
> exposed bug in pbs?)
> 
> 2)       it got deleted somehow
> 
> 3)       we have cluster gnomes whom come out at night and do strange things
> to our boxes.

I'd go with #3 :)

Network glitch?  Full /var or /tmp somewhere?  Filesystem error?

If you can come up with a way to reproduce this, I can fix it.

 
> anyone else seen this?
> 
>  
> 
> Also, does torque have a feature that if say K jobs have failed on node Y
> maybe in some time span T, automatically mark it offline and email the
> admin?
> 
> (it seems we could write a quick perl hack to do this, by why reinvent..?)

You could do a health check script that checks the mom logs and throws
an error.

Otherwise, I'm sure moab could do this.



More information about the torqueusers mailing list