[torqueusers] fault tolerance

Garrick Staples garrick at clusterresources.com
Thu Aug 10 12:08:30 MDT 2006

On Thu, Aug 10, 2006 at 10:33:11AM -0700, Alexander Saydakov alleged:
> Hi!
> I saw the following situation on our cluster: one node was rebooted, and
> after it came back online it reported that all three jobs, which were
> running on it before the reboot, are finished successfully.
> So my questions are:
> 1.	Suppose a node goes down forever. And suppose that an application,
> which submitted it, does 'qsub -f' periodically to find out the status (with
> keep_completed set). When, if ever, Torque will report the job as failed?

If it was a sister node, the job will do a normal exit when it is killed
or a limit is reached (normal job exit).

If it was the job's MS, the job waits forever for the node to come back.
If the MS is truely gone forever, then the admin should use 'qdel -p' to
purge the job from pbs_server.

> 2.	Suppose that node comes back after a while (before Torque gives up
> waiting), shouldn't pbs_mom report all jobs as failed if they are indeed
> gone?

pbs_mom doesn't "know" the node has rebooted.  It's action depends on
the command-line args.

> Note: we start pbs_mom with -p option, which we understood as to keep
> running jobs (if any).

Don't use -p or -r on boot.  -p tells pbs_mom to preserve jobs, even if
the processes aren't a child, which means that exit status can't be known.
Since the process is gone, the job is assumed successfully exited.

More information about the torqueusers mailing list