[torqueusers] job execution error
Garrick Staples
garrick at clusterresources.com
Thu Oct 5 14:10:11 MDT 2006
On Thu, Oct 05, 2006 at 10:42:28AM -0700, Sam Rash alleged:
> So we've got a 'drone script' that we've been running through this torque
> server 10k times a day w/o problems. Suddenly one node gets this in the
> stderr (.ER file) for a job:
>
>
>
> -bash: line 1: /home/y/var/pbs/mom_priv/jobs/1899889.med.SC: No such file or
> directory
>
>
>
> Isn't that the generated script PBS makes for you when you do echo my
> command | qsub ?
Or passing a script to qsub. Both methods generate a
mom_priv/jobs/seq.host.SC file.
> Does this simply mean
>
> 1) it wasn't created somehow? (newly created bug in our setup, newly
> exposed bug in pbs?)
>
> 2) it got deleted somehow
>
> 3) we have cluster gnomes whom come out at night and do strange things
> to our boxes.
I'd go with #3 :)
Network glitch? Full /var or /tmp somewhere? Filesystem error?
If you can come up with a way to reproduce this, I can fix it.
> anyone else seen this?
>
>
>
> Also, does torque have a feature that if say K jobs have failed on node Y
> maybe in some time span T, automatically mark it offline and email the
> admin?
>
> (it seems we could write a quick perl hack to do this, by why reinvent..?)
You could do a health check script that checks the mom logs and throws
an error.
Otherwise, I'm sure moab could do this.
More information about the torqueusers
mailing list