[torqueusers] job execution error

Sam Rash srash at yahoo-inc.com
Thu Oct 5 11:42:28 MDT 2006


So we've got a 'drone script' that we've been running through this torque
server 10k times a day w/o problems.  Suddenly one node gets this in the
stderr (.ER file) for a job:

 

-bash: line 1: /home/y/var/pbs/mom_priv/jobs/1899889.med.SC: No such file or
directory

 

Isn't that the generated script PBS makes for you when you do echo my
command | qsub ?

Does this simply mean

1)       it wasn't created somehow? (newly created bug in our setup, newly
exposed bug in pbs?)

2)       it got deleted somehow

3)       we have cluster gnomes whom come out at night and do strange things
to our boxes.

 

anyone else seen this?

 

Also, does torque have a feature that if say K jobs have failed on node Y
maybe in some time span T, automatically mark it offline and email the
admin?

(it seems we could write a quick perl hack to do this, by why reinvent..?)

 

 

Thanks,

 

Sam Rash

srash at yahoo-inc.com

408-349-7312

vertigosr37

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20061005/c3a51cb9/attachment.html


More information about the torqueusers mailing list