[torqueusers] fault tolerance

Alexander Saydakov saydakov at yahoo-inc.com
Thu Aug 10 15:08:45 MDT 2006


		-----Original Message-----
		From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org] On Behalf Of Garrick Staples
		Sent: Thursday, August 10, 2006 11:09 AM
		To: torqueusers at supercluster.org
		Subject: Re: [torqueusers] fault tolerance

		If it was the job's MS, the job waits forever for the node
to come back.
		If the MS is truely gone forever, then the admin should use
'qdel -p' to
		purge the job from pbs_server.
		

This does not sound right to me. Can we use an enforced walltime limit to
work around it?
		
		
		> 2.	Suppose that node comes back after a while (before
Torque gives up
		> waiting), shouldn't pbs_mom report all jobs as failed if
they are indeed
		> gone?

		pbs_mom doesn't "know" the node has rebooted.  It's action
depends on
		the command-line args.
		
Yes, it does not know about the reboot, but it should be able to track jobs
somehow and realize that those pids are gone or belong to a different
process.

		
		> Note: we start pbs_mom with -p option, which we understood
as to keep
		> running jobs (if any).

		Don't use -p or -r on boot.  -p tells pbs_mom to preserve
jobs, even if
		the processes aren't a child, which means that exit status
can't be known.
		Since the process is gone, the job is assumed successfully
exited.

		
Are you saying that without -p it would do the right thing - report those
lost jobs as failed?

Thanks a lot.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 4616 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20060810/82946ef1/winmail-0001.bin


More information about the torqueusers mailing list