[torqueusers] fault tolerance

'Garrick Staples' garrick at clusterresources.com
Thu Aug 10 15:31:15 MDT 2006


This odd quoting is really hard to read!

On Thu, Aug 10, 2006 at 02:08:45PM -0700, Alexander Saydakov alleged:
> 		-----Original Message-----
> 		From: torqueusers-bounces at supercluster.org
> [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Garrick Staples
> 		Sent: Thursday, August 10, 2006 11:09 AM
> 		To: torqueusers at supercluster.org
> 		Subject: Re: [torqueusers] fault tolerance
> 
> 		If it was the job's MS, the job waits forever for the node
> to come back.
> 		If the MS is truely gone forever, then the admin should use
> 'qdel -p' to
> 		purge the job from pbs_server.
> 		
> 
> This does not sound right to me. Can we use an enforced walltime limit to
> work around it?

No, if MS is down, the job will not exit.

We have plans to correct this, but presently, jobs will never exit while MS
is down. 
 		
 		
> 		> 2.	Suppose that node comes back after a while (before
> Torque gives up
> 		> waiting), shouldn't pbs_mom report all jobs as failed if
> they are indeed
> 		> gone?
> 
> 		pbs_mom doesn't "know" the node has rebooted.  It's action
> depends on
> 		the command-line args.
> 		
> Yes, it does not know about the reboot, but it should be able to track jobs
> somehow and realize that those pids are gone or belong to a different
> process.

Right, which is why the job exits after pbs_mom starts, the PIDs aren't
around anymore (which is a pretty shaky assumption to make after a
reboot because PIDs might be reused).

		
> 		> Note: we start pbs_mom with -p option, which we understood
> as to keep
> 		> running jobs (if any).
> 
> 		Don't use -p or -r on boot.  -p tells pbs_mom to preserve
> jobs, even if
> 		the processes aren't a child, which means that exit status
> can't be known.
> 		Since the process is gone, the job is assumed successfully
> exited.
> 
> 		
> Are you saying that without -p it would do the right thing - report those
> lost jobs as failed?

I haven't verified whether it is reported as error, but I believe so,
yes.

Note that the pbs_mom manpage has details on -r, -p, and that no option
should be used at boot.  Also, see the sample initscripts in
contrib/init.d/.




More information about the torqueusers mailing list