[torqueusers] Rerunable jobs for diskless nodes?
chemadm at hamilton.edu
Fri Apr 17 07:54:23 MDT 2009
I'm not using HA on our grid but I thought I would add that I can
restart torque/maui on the server and it doesn't effect the running
jobs on the nodes. I would like to think HA would operate in the same
As for re-runnable jobs here's my take on it. A job that can be re-
run is one that can be started over from the beginning. Meaning the
program your using can't do checkpointing so if it failed part way
through the only thing you can do is start it over. At least by making
this job re-runnable you don't have to manually re-queue the job and
loose your place in the queue.
However, some types of jobs can be set up to pick up where they left
off if the job failed for some reason. This is not re-runnable since
usually the user needs to change some of the files around for the job
to make sure it picks up where it left off running from before. You
wouldn't want this job to re-run since it would start over from the
beginning and you'd loose all the computations that had already been
done up to that point.
I could be wrong too since some programs might be smart enough to
know how to pick up where it left off when re-run, without user
intervention. In this case it might be a re-runnable job since it
actually can continue on from where it last left off. I hope this helps,
On Apr 16, 2009, at 8:39 PM, Victor Gregorio wrote:
> Hey folks,
> I configured the compute nodes to use persistent, disk storage for the
> mom_priv folder.
> Now, when the failover from primary to secondary pbs_server occurs,
> running jobs marked rerunable do not abort. As hoped, they show up
> the Running state, then Exit and Complete. But, when you look at the
> output, the job does not complete it's task. The job output is
> truncated at the point of pbs_server failover.
> I expected these rerunable jobs to be requeued when the failover
> occurred. When do jobs get rerun? Or am I misunderstanding what
> rerunable jobs are?
> Thank you,
> Victor Gregorio
> Penguin Computing
> On Wed, Apr 15, 2009 at 03:37:31PM -0700, Victor Gregorio wrote:
>> Hello all,
>> I have a Torque failover configuration using CentOS' heartbeat
>> When the failover from the primary to the secondary pbs_server system
>> happens, all running jobs marked rerunable try to restart on the
>> secondary system.
>> The problem is that the diskless compute nodes get re-provisioned,
>> destroying all pbs_mom information about the running jobs. So,
>> when the
>> rerunable jobs try to run, they are aborted with "Job does not
>> exist on
>> Is there a way for the pbs_server to resubmit a rerunable job using
>> that is kept on the pbs_server? I have tried tweaking the -t
>> option to
>> 'hot' as well as changing mom_job_sync to False, but no luck. Any
>> advice is appreciated.
>> Victor Gregorio
>> Penguin Computing
>> torqueusers mailing list
>> torqueusers at supercluster.org
> torqueusers mailing list
> torqueusers at supercluster.org
More information about the torqueusers