[torqueusers] fault tolerance
garrick at clusterresources.com
Thu Aug 10 15:31:15 MDT 2006
This odd quoting is really hard to read!
On Thu, Aug 10, 2006 at 02:08:45PM -0700, Alexander Saydakov alleged:
> -----Original Message-----
> From: torqueusers-bounces at supercluster.org
> [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Garrick Staples
> Sent: Thursday, August 10, 2006 11:09 AM
> To: torqueusers at supercluster.org
> Subject: Re: [torqueusers] fault tolerance
> If it was the job's MS, the job waits forever for the node
> to come back.
> If the MS is truely gone forever, then the admin should use
> 'qdel -p' to
> purge the job from pbs_server.
> This does not sound right to me. Can we use an enforced walltime limit to
> work around it?
No, if MS is down, the job will not exit.
We have plans to correct this, but presently, jobs will never exit while MS
> > 2. Suppose that node comes back after a while (before
> Torque gives up
> > waiting), shouldn't pbs_mom report all jobs as failed if
> they are indeed
> > gone?
> pbs_mom doesn't "know" the node has rebooted. It's action
> depends on
> the command-line args.
> Yes, it does not know about the reboot, but it should be able to track jobs
> somehow and realize that those pids are gone or belong to a different
Right, which is why the job exits after pbs_mom starts, the PIDs aren't
around anymore (which is a pretty shaky assumption to make after a
reboot because PIDs might be reused).
> > Note: we start pbs_mom with -p option, which we understood
> as to keep
> > running jobs (if any).
> Don't use -p or -r on boot. -p tells pbs_mom to preserve
> jobs, even if
> the processes aren't a child, which means that exit status
> can't be known.
> Since the process is gone, the job is assumed successfully
> Are you saying that without -p it would do the right thing - report those
> lost jobs as failed?
I haven't verified whether it is reported as error, but I believe so,
Note that the pbs_mom manpage has details on -r, -p, and that no option
should be used at boot. Also, see the sample initscripts in
More information about the torqueusers