[torqueusers] Rerunable jobs for diskless nodes?
vgregorio at penguincomputing.com
Fri Apr 17 10:55:06 MDT 2009
Thanks for the reply. My comments are inline...
On Fri, Apr 17, 2009 at 09:54:23AM -0400, Steve Young wrote:
> I'm not using HA on our grid but I thought I would add that I can
> restart torque/maui on the server and it doesn't effect the running jobs
> on the nodes. I would like to think HA would operate in the same fashion.
In my configuration, the compute nodes reboot during the pbs_server
failover. Since the mom_priv folder is on persistent disk storage, I
expected rerunable jobs in the Running state to be requeued.
> As for re-runnable jobs here's my take on it. A job that can be re-run is
> one that can be started over from the beginning. Meaning the program your
> using can't do checkpointing so if it failed part way through the only
> thing you can do is start it over. At least by making this job re-runnable
> you don't have to manually re-queue the job and loose your place in the
This is what I expected as well, but have not been able to create a
scenario where a job restarts from the beginning. Should running,
rerunable jobs restart if compute nodes are rebooted during a pbs_server
> However, some types of jobs can be set up to pick up where they left off
> if the job failed for some reason. This is not re-runnable since usually
> the user needs to change some of the files around for the job to make sure
> it picks up where it left off running from before. You wouldn't want this
> job to re-run since it would start over from the beginning and you'd loose
> all the computations that had already been done up to that point.
> I could be wrong too since some programs might be smart enough to know
> how to pick up where it left off when re-run, without user intervention.
> In this case it might be a re-runnable job since it actually can continue
> on from where it last left off. I hope this helps,
> On Apr 16, 2009, at 8:39 PM, Victor Gregorio wrote:
>> Hey folks,
>> I configured the compute nodes to use persistent, disk storage for the
>> mom_priv folder.
>> Now, when the failover from primary to secondary pbs_server occurs,
>> running jobs marked rerunable do not abort. As hoped, they show up in
>> the Running state, then Exit and Complete. But, when you look at the
>> output, the job does not complete it's task. The job output is
>> truncated at the point of pbs_server failover.
>> I expected these rerunable jobs to be requeued when the failover
>> occurred. When do jobs get rerun? Or am I misunderstanding what
>> rerunable jobs are?
>> Thank you,
>> Victor Gregorio
>> Penguin Computing
>> On Wed, Apr 15, 2009 at 03:37:31PM -0700, Victor Gregorio wrote:
>>> Hello all,
>>> I have a Torque failover configuration using CentOS' heartbeat
>>> When the failover from the primary to the secondary pbs_server system
>>> happens, all running jobs marked rerunable try to restart on the
>>> secondary system.
>>> The problem is that the diskless compute nodes get re-provisioned,
>>> destroying all pbs_mom information about the running jobs. So, when
>>> rerunable jobs try to run, they are aborted with "Job does not exist
>>> Is there a way for the pbs_server to resubmit a rerunable job using
>>> that is kept on the pbs_server? I have tried tweaking the -t option
>>> 'hot' as well as changing mom_job_sync to False, but no luck. Any
>>> advice is appreciated.
>>> Victor Gregorio
>>> Penguin Computing
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>> torqueusers mailing list
>> torqueusers at supercluster.org
More information about the torqueusers