[torqueusers] Rerunable jobs for diskless nodes?

Victor Gregorio vgregorio at penguincomputing.com
Fri Apr 17 10:55:06 MDT 2009


Hey Steve,

Thanks for the reply. My comments are inline... 

On Fri, Apr 17, 2009 at 09:54:23AM -0400, Steve Young wrote:
> Hi,
> 	I'm not using HA on our grid but I thought I would add that I can  
> restart torque/maui on the server and it doesn't effect the running jobs 
> on the nodes. I would like to think HA would operate in the same fashion.

In my configuration, the compute nodes reboot during the pbs_server
failover.  Since the mom_priv folder is on persistent disk storage, I
expected rerunable jobs in the Running state to be requeued.

> 	As for re-runnable jobs here's my take on it. A job that can be re-run is 
> one that can be started over from the beginning. Meaning the program your 
> using can't do checkpointing so if it failed part way through the only 
> thing you can do is start it over. At least by making this job re-runnable 
> you don't have to manually re-queue the job and loose your place in the 
> queue.

This is what I expected as well, but have not been able to create a
scenario where a job restarts from the beginning.  Should running,
rerunable jobs restart if compute nodes are rebooted during a pbs_server
failover?

Regards,

-- 
Victor Gregorio
Penguin Computing

> 	However, some types of jobs can be set up to pick up where they left off 
> if the job failed for some reason. This is not re-runnable since usually 
> the user needs to change some of the files around for the job to make sure 
> it picks up where it left off running from before. You wouldn't want this 
> job to re-run since it would start over from the beginning and you'd loose 
> all the computations that had already been done up to that point.
>
> 	I could be wrong too since some programs might be smart enough to know 
> how to pick up where it left off when re-run, without user intervention. 
> In this case it might be a re-runnable job since it actually can continue 
> on from where it last left off. I hope this helps,
>
> -Steve
>
>
> On Apr 16, 2009, at 8:39 PM, Victor Gregorio wrote:
>
>> Hey folks,
>>
>> I configured the compute nodes to use persistent, disk storage for the
>> mom_priv folder.
>>
>> Now, when the failover from primary to secondary pbs_server occurs,
>> running jobs marked rerunable do not abort.   As hoped, they show up in
>> the Running state, then Exit and Complete.  But, when you look at the
>> output, the job does not complete it's task.  The job output is
>> truncated at the point of pbs_server failover.
>>
>> I expected these rerunable jobs to be requeued when the failover
>> occurred.  When do jobs get rerun?  Or am I misunderstanding what
>> rerunable jobs are?
>>
>> Reference:
>> http://www.clusterresources.com/pipermail/torqueusers/2006-August/004107.html
>>
>> Thank you,
>>
>> -- 
>> Victor Gregorio
>> Penguin Computing
>>
>> On Wed, Apr 15, 2009 at 03:37:31PM -0700, Victor Gregorio wrote:
>>> Hello all,
>>>
>>> I have a Torque failover configuration using CentOS' heartbeat  
>>> services.
>>> When the failover from the primary to the secondary pbs_server system
>>> happens, all running jobs marked rerunable try to restart on the
>>> secondary system.
>>>
>>> The problem is that the diskless compute nodes get re-provisioned,
>>> destroying all pbs_mom information about the running jobs.  So, when 
>>> the
>>> rerunable jobs try to run, they are aborted with "Job does not exist 
>>> on
>>> node".
>>>
>>> Is there a way for the pbs_server to resubmit a rerunable job using  
>>> data
>>> that is kept on the pbs_server?  I have tried tweaking the -t option 
>>> to
>>> 'hot' as well as changing mom_job_sync to False, but no luck.  Any
>>> advice is appreciated.
>>>
>>> -- 
>>> Victor Gregorio
>>> Penguin Computing
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>


More information about the torqueusers mailing list