[torqueusers] fault tolerance

Alexander Saydakov saydakov at yahoo-inc.com
Thu Aug 10 11:33:11 MDT 2006



I saw the following situation on our cluster: one node was rebooted, and
after it came back online it reported that all three jobs, which were
running on it before the reboot, are finished successfully.


So my questions are:


1.	Suppose a node goes down forever. And suppose that an application,
which submitted it, does 'qsub -f' periodically to find out the status (with
keep_completed set). When, if ever, Torque will report the job as failed?
2.	Suppose that node comes back after a while (before Torque gives up
waiting), shouldn't pbs_mom report all jobs as failed if they are indeed


Note: we start pbs_mom with -p option, which we understood as to keep
running jobs (if any).




-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20060810/2454abc2/attachment.html

More information about the torqueusers mailing list