[torqueusers] fault tolerance
saydakov at yahoo-inc.com
Thu Aug 10 15:08:45 MDT 2006
From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org] On Behalf Of Garrick Staples
Sent: Thursday, August 10, 2006 11:09 AM
To: torqueusers at supercluster.org
Subject: Re: [torqueusers] fault tolerance
If it was the job's MS, the job waits forever for the node
to come back.
If the MS is truely gone forever, then the admin should use
'qdel -p' to
purge the job from pbs_server.
This does not sound right to me. Can we use an enforced walltime limit to
work around it?
> 2. Suppose that node comes back after a while (before
Torque gives up
> waiting), shouldn't pbs_mom report all jobs as failed if
they are indeed
pbs_mom doesn't "know" the node has rebooted. It's action
the command-line args.
Yes, it does not know about the reboot, but it should be able to track jobs
somehow and realize that those pids are gone or belong to a different
> Note: we start pbs_mom with -p option, which we understood
as to keep
> running jobs (if any).
Don't use -p or -r on boot. -p tells pbs_mom to preserve
jobs, even if
the processes aren't a child, which means that exit status
can't be known.
Since the process is gone, the job is assumed successfully
Are you saying that without -p it would do the right thing - report those
lost jobs as failed?
Thanks a lot.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 4616 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20060810/82946ef1/winmail-0001.bin
More information about the torqueusers