[torqueusers] Rerunable jobs for diskless nodes?

Steve Young chemadm at hamilton.edu
Fri Apr 17 11:18:12 MDT 2009

comments below....

On Apr 17, 2009, at 12:55 PM, Victor Gregorio wrote:

> Hey Steve,
> Thanks for the reply. My comments are inline...
> On Fri, Apr 17, 2009 at 09:54:23AM -0400, Steve Young wrote:
>> Hi,
>> 	I'm not using HA on our grid but I thought I would add that I can
>> restart torque/maui on the server and it doesn't effect the running  
>> jobs
>> on the nodes. I would like to think HA would operate in the same  
>> fashion.
> In my configuration, the compute nodes reboot during the pbs_server
> failover.  Since the mom_priv folder is on persistent disk storage, I
> expected rerunable jobs in the Running state to be requeued.

yea if the pbs_mom's restart your job is done... I'd expect that they  
*should* get re-run and in fact I believe our grid works like this  
(but I'm not for certain really haven't fully tested it and I'm in  
need of a torque upgrade soon anyhow). I do know that a recent power  
outage and and a bad UPS that some nodes were rebooted and those jobs  
did restart. But the troubling thing is I'm not sure of is why since  
the user didn't specify this in their batch script. I know another  
user's jobs did not re-start too.

>> 	As for re-runnable jobs here's my take on it. A job that can be re- 
>> run is
>> one that can be started over from the beginning. Meaning the  
>> program your
>> using can't do checkpointing so if it failed part way through the  
>> only
>> thing you can do is start it over. At least by making this job re- 
>> runnable
>> you don't have to manually re-queue the job and loose your place in  
>> the
>> queue.
> This is what I expected as well, but have not been able to create a
> scenario where a job restarts from the beginning.  Should running,
> rerunable jobs restart if compute nodes are rebooted during a  
> pbs_server
> failover?

I would think yes especially if you declare #PBS -r y.

> Regards,
> -- 
> Victor Gregorio
> Penguin Computing
>> 	However, some types of jobs can be set up to pick up where they  
>> left off
>> if the job failed for some reason. This is not re-runnable since  
>> usually
>> the user needs to change some of the files around for the job to  
>> make sure
>> it picks up where it left off running from before. You wouldn't  
>> want this
>> job to re-run since it would start over from the beginning and  
>> you'd loose
>> all the computations that had already been done up to that point.
>> 	I could be wrong too since some programs might be smart enough to  
>> know
>> how to pick up where it left off when re-run, without user  
>> intervention.
>> In this case it might be a re-runnable job since it actually can  
>> continue
>> on from where it last left off. I hope this helps,
>> -Steve
>> On Apr 16, 2009, at 8:39 PM, Victor Gregorio wrote:
>>> Hey folks,
>>> I configured the compute nodes to use persistent, disk storage for  
>>> the
>>> mom_priv folder.
>>> Now, when the failover from primary to secondary pbs_server occurs,
>>> running jobs marked rerunable do not abort.   As hoped, they show  
>>> up in
>>> the Running state, then Exit and Complete.  But, when you look at  
>>> the
>>> output, the job does not complete it's task.  The job output is
>>> truncated at the point of pbs_server failover.
>>> I expected these rerunable jobs to be requeued when the failover
>>> occurred.  When do jobs get rerun?  Or am I misunderstanding what
>>> rerunable jobs are?
>>> Reference:
>>> http://www.clusterresources.com/pipermail/torqueusers/2006-August/004107.html
>>> Thank you,
>>> -- 
>>> Victor Gregorio
>>> Penguin Computing
>>> On Wed, Apr 15, 2009 at 03:37:31PM -0700, Victor Gregorio wrote:
>>>> Hello all,
>>>> I have a Torque failover configuration using CentOS' heartbeat
>>>> services.
>>>> When the failover from the primary to the secondary pbs_server  
>>>> system
>>>> happens, all running jobs marked rerunable try to restart on the
>>>> secondary system.
>>>> The problem is that the diskless compute nodes get re-provisioned,
>>>> destroying all pbs_mom information about the running jobs.  So,  
>>>> when
>>>> the
>>>> rerunable jobs try to run, they are aborted with "Job does not  
>>>> exist
>>>> on
>>>> node".
>>>> Is there a way for the pbs_server to resubmit a rerunable job using
>>>> data
>>>> that is kept on the pbs_server?  I have tried tweaking the -t  
>>>> option
>>>> to
>>>> 'hot' as well as changing mom_job_sync to False, but no luck.  Any
>>>> advice is appreciated.
>>>> -- 
>>>> Victor Gregorio
>>>> Penguin Computing
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers

More information about the torqueusers mailing list