[torqueusers] job execution error
Sam Rash
srash at yahoo-inc.com
Thu Oct 5 11:42:28 MDT 2006
So we've got a 'drone script' that we've been running through this torque
server 10k times a day w/o problems. Suddenly one node gets this in the
stderr (.ER file) for a job:
-bash: line 1: /home/y/var/pbs/mom_priv/jobs/1899889.med.SC: No such file or
directory
Isn't that the generated script PBS makes for you when you do echo my
command | qsub ?
Does this simply mean
1) it wasn't created somehow? (newly created bug in our setup, newly
exposed bug in pbs?)
2) it got deleted somehow
3) we have cluster gnomes whom come out at night and do strange things
to our boxes.
anyone else seen this?
Also, does torque have a feature that if say K jobs have failed on node Y
maybe in some time span T, automatically mark it offline and email the
admin?
(it seems we could write a quick perl hack to do this, by why reinvent..?)
Thanks,
Sam Rash
srash at yahoo-inc.com
408-349-7312
vertigosr37
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20061005/c3a51cb9/attachment.html
More information about the torqueusers
mailing list