[torqueusers] Rerunable jobs for diskless nodes?

Victor Gregorio vgregorio at penguincomputing.com
Fri Apr 17 12:21:31 MDT 2009


Unfortunately, my running jobs submitted with #PBS -r y are not
restarting when a pbs_server failover happens and the whole cluster
reboots.  

To limit the scope of my testing, I removed the pbs_server failover from
the scenario and just rebooted the compute nodes in the middle of
executing a rerunable job.

The results were the same: Once the pbs_moms reconnect to the
pbs_server, any previously running, rerunable jobs Exit prematurely and
are marked Completed instead of restarting.

Am I missing a configuration tweak? Maybe I am misunderstanding how
rerunable jobs behave?  Any advice is appreciated.

-- 
Victor Gregorio
Penguin Computing

On Fri, Apr 17, 2009 at 01:18:12PM -0400, Steve Young wrote:
> comments below....
>
> On Apr 17, 2009, at 12:55 PM, Victor Gregorio wrote:
>
>> Hey Steve,
>>
>> Thanks for the reply. My comments are inline...
>>
>> On Fri, Apr 17, 2009 at 09:54:23AM -0400, Steve Young wrote:
>>> Hi,
>>> 	I'm not using HA on our grid but I thought I would add that I can
>>> restart torque/maui on the server and it doesn't effect the running  
>>> jobs
>>> on the nodes. I would like to think HA would operate in the same  
>>> fashion.
>>
>> In my configuration, the compute nodes reboot during the pbs_server
>> failover.  Since the mom_priv folder is on persistent disk storage, I
>> expected rerunable jobs in the Running state to be requeued.
>>
>
> yea if the pbs_mom's restart your job is done... I'd expect that they  
> *should* get re-run and in fact I believe our grid works like this (but 
> I'm not for certain really haven't fully tested it and I'm in need of a 
> torque upgrade soon anyhow). I do know that a recent power outage and and 
> a bad UPS that some nodes were rebooted and those jobs did restart. But 
> the troubling thing is I'm not sure of is why since the user didn't 
> specify this in their batch script. I know another user's jobs did not 
> re-start too.
>
>
>>> 	As for re-runnable jobs here's my take on it. A job that can be re- 
>>> run is
>>> one that can be started over from the beginning. Meaning the program 
>>> your
>>> using can't do checkpointing so if it failed part way through the  
>>> only
>>> thing you can do is start it over. At least by making this job re- 
>>> runnable
>>> you don't have to manually re-queue the job and loose your place in  
>>> the
>>> queue.
>>
>> This is what I expected as well, but have not been able to create a
>> scenario where a job restarts from the beginning.  Should running,
>> rerunable jobs restart if compute nodes are rebooted during a  
>> pbs_server
>> failover?
>>
>
>
> I would think yes especially if you declare #PBS -r y.
>
>
>
>> Regards,
>>
>> -- 
>> Victor Gregorio
>> Penguin Computing
>>
>>> 	However, some types of jobs can be set up to pick up where they left 
>>> off
>>> if the job failed for some reason. This is not re-runnable since  
>>> usually
>>> the user needs to change some of the files around for the job to make 
>>> sure
>>> it picks up where it left off running from before. You wouldn't want 
>>> this
>>> job to re-run since it would start over from the beginning and you'd 
>>> loose
>>> all the computations that had already been done up to that point.
>>>
>>> 	I could be wrong too since some programs might be smart enough to  
>>> know
>>> how to pick up where it left off when re-run, without user  
>>> intervention.
>>> In this case it might be a re-runnable job since it actually can  
>>> continue
>>> on from where it last left off. I hope this helps,
>>>
>>> -Steve
>>>
>>>
>>> On Apr 16, 2009, at 8:39 PM, Victor Gregorio wrote:
>>>
>>>> Hey folks,
>>>>
>>>> I configured the compute nodes to use persistent, disk storage for  
>>>> the
>>>> mom_priv folder.
>>>>
>>>> Now, when the failover from primary to secondary pbs_server occurs,
>>>> running jobs marked rerunable do not abort.   As hoped, they show  
>>>> up in
>>>> the Running state, then Exit and Complete.  But, when you look at  
>>>> the
>>>> output, the job does not complete it's task.  The job output is
>>>> truncated at the point of pbs_server failover.
>>>>
>>>> I expected these rerunable jobs to be requeued when the failover
>>>> occurred.  When do jobs get rerun?  Or am I misunderstanding what
>>>> rerunable jobs are?
>>>>
>>>> Reference:
>>>> http://www.clusterresources.com/pipermail/torqueusers/2006-August/004107.html
>>>>
>>>> Thank you,
>>>>
>>>> -- 
>>>> Victor Gregorio
>>>> Penguin Computing
>>>>
>>>> On Wed, Apr 15, 2009 at 03:37:31PM -0700, Victor Gregorio wrote:
>>>>> Hello all,
>>>>>
>>>>> I have a Torque failover configuration using CentOS' heartbeat
>>>>> services.
>>>>> When the failover from the primary to the secondary pbs_server  
>>>>> system
>>>>> happens, all running jobs marked rerunable try to restart on the
>>>>> secondary system.
>>>>>
>>>>> The problem is that the diskless compute nodes get re-provisioned,
>>>>> destroying all pbs_mom information about the running jobs.  So,  
>>>>> when
>>>>> the
>>>>> rerunable jobs try to run, they are aborted with "Job does not  
>>>>> exist
>>>>> on
>>>>> node".
>>>>>
>>>>> Is there a way for the pbs_server to resubmit a rerunable job using
>>>>> data
>>>>> that is kept on the pbs_server?  I have tried tweaking the -t  
>>>>> option
>>>>> to
>>>>> 'hot' as well as changing mom_job_sync to False, but no luck.  Any
>>>>> advice is appreciated.
>>>>>
>>>>> -- 
>>>>> Victor Gregorio
>>>>> Penguin Computing
>>>>>
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> torqueusers at supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>


More information about the torqueusers mailing list