[torqueusers] Rerunable jobs for diskless nodes?

Victor Gregorio vgregorio at penguincomputing.com
Mon Apr 20 09:57:45 MDT 2009


Hey Garrick,

Using Torque 2.1.9, I can get job restarts to reliably work after
execution node reboots.  Starting with 2.1.10, this functionality
breaks.

Should I bring this up with the torque-devel list?

-- 
Victor Gregorio
Penguin Computing

On Fri, Apr 17, 2009 at 07:49:35PM -0700, Garrick wrote:
> I don't think job rwstarts or exits have ever been reliable when the  
> execution node's mom restarts or reboots.
>
> HPCC/Linux Systems Admin
>
> On Apr 17, 2009, at 7:47 PM, Victor Gregorio 
> <vgregorio at penguincomputing.com> wrote:
>
>> Oops!
>>
>> :%s/exists/exits/g
>>
>> -- 
>> Victor Gregorio
>> Penguin Computing
>>
>> On Fri, Apr 17, 2009 at 02:40:47PM -0700, Victor Gregorio wrote:
>>> Hey folks,
>>>
>>> I think there might be a Torque bug regarding rerunable jobs starting 
>>> in
>>> version 2.1.10.  We tested various Torque releases using the  
>>> following
>>> scenario:
>>>
>>> 1) Submit a rerunable job that uses all nodes
>>> 2) Reboot all pbs_mom nodes after job is in Running state
>>> 3) Wait for nodes to reboot and pbs_moms to start
>>> 4) See what happens to the rerunable job
>>>
>>> Rerunable jobs only restarted in version 2.1.9.  Here are the  
>>> details:
>>>
>>> Version          Results
>>> =======          =======
>>> 2.1.9            Job restarts as expected
>>> 2.1.10           Job does not restart, it hangs in exit state
>>> 2.1.11           Job does not restart, it exits and completes
>>> 2.3.0            Job does not restart, it hangs in exit state
>>> 2.3.6            Job does not restart, it exists and completes
>>> 2.3.7            Job does not restart, it exists and completes
>>> 2.4.1b1          Job does not restart, it exists and completes
>>>
>>> Please advise.
>>>
>>> -- 
>>> Victor Gregorio
>>> Penguin Computing
>>>
>>> On Fri, Apr 17, 2009 at 11:21:31AM -0700, Victor Gregorio wrote:
>>>> Unfortunately, my running jobs submitted with #PBS -r y are not
>>>> restarting when a pbs_server failover happens and the whole cluster
>>>> reboots.
>>>>
>>>> To limit the scope of my testing, I removed the pbs_server failover 
>>>> from
>>>> the scenario and just rebooted the compute nodes in the middle of
>>>> executing a rerunable job.
>>>>
>>>> The results were the same: Once the pbs_moms reconnect to the
>>>> pbs_server, any previously running, rerunable jobs Exit prematurely 
>>>> and
>>>> are marked Completed instead of restarting.
>>>>
>>>> Am I missing a configuration tweak? Maybe I am misunderstanding how
>>>> rerunable jobs behave?  Any advice is appreciated.
>>>>
>>>> -- 
>>>> Victor Gregorio
>>>> Penguin Computing
>>>>
>>>> On Fri, Apr 17, 2009 at 01:18:12PM -0400, Steve Young wrote:
>>>>> comments below....
>>>>>
>>>>> On Apr 17, 2009, at 12:55 PM, Victor Gregorio wrote:
>>>>>
>>>>>> Hey Steve,
>>>>>>
>>>>>> Thanks for the reply. My comments are inline...
>>>>>>
>>>>>> On Fri, Apr 17, 2009 at 09:54:23AM -0400, Steve Young wrote:
>>>>>>> Hi,
>>>>>>>    I'm not using HA on our grid but I thought I would add that 
>>>>>>> I can
>>>>>>> restart torque/maui on the server and it doesn't effect the  
>>>>>>> running
>>>>>>> jobs
>>>>>>> on the nodes. I would like to think HA would operate in the same
>>>>>>> fashion.
>>>>>>
>>>>>> In my configuration, the compute nodes reboot during the  
>>>>>> pbs_server
>>>>>> failover.  Since the mom_priv folder is on persistent disk  
>>>>>> storage, I
>>>>>> expected rerunable jobs in the Running state to be requeued.
>>>>>>
>>>>>
>>>>> yea if the pbs_mom's restart your job is done... I'd expect that  
>>>>> they
>>>>> *should* get re-run and in fact I believe our grid works like  
>>>>> this (but
>>>>> I'm not for certain really haven't fully tested it and I'm in  
>>>>> need of a
>>>>> torque upgrade soon anyhow). I do know that a recent power outage  
>>>>> and and
>>>>> a bad UPS that some nodes were rebooted and those jobs did  
>>>>> restart. But
>>>>> the troubling thing is I'm not sure of is why since the user didn't
>>>>> specify this in their batch script. I know another user's jobs  
>>>>> did not
>>>>> re-start too.
>>>>>
>>>>>
>>>>>>>    As for re-runnable jobs here's my take on it. A job that 
>>>>>>> can be re-
>>>>>>> run is
>>>>>>> one that can be started over from the beginning. Meaning the  
>>>>>>> program
>>>>>>> your
>>>>>>> using can't do checkpointing so if it failed part way through the
>>>>>>> only
>>>>>>> thing you can do is start it over. At least by making this job 
>>>>>>> re-
>>>>>>> runnable
>>>>>>> you don't have to manually re-queue the job and loose your  
>>>>>>> place in
>>>>>>> the
>>>>>>> queue.
>>>>>>
>>>>>> This is what I expected as well, but have not been able to  
>>>>>> create a
>>>>>> scenario where a job restarts from the beginning.  Should running,
>>>>>> rerunable jobs restart if compute nodes are rebooted during a
>>>>>> pbs_server
>>>>>> failover?
>>>>>>
>>>>>
>>>>>
>>>>> I would think yes especially if you declare #PBS -r y.
>>>>>
>>>>>
>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> -- 
>>>>>> Victor Gregorio
>>>>>> Penguin Computing
>>>>>>
>>>>>>>    However, some types of jobs can be set up to pick up where  
>>>>>>> they left
>>>>>>> off
>>>>>>> if the job failed for some reason. This is not re-runnable since
>>>>>>> usually
>>>>>>> the user needs to change some of the files around for the job  
>>>>>>> to make
>>>>>>> sure
>>>>>>> it picks up where it left off running from before. You 
>>>>>>> wouldn't want
>>>>>>> this
>>>>>>> job to re-run since it would start over from the beginning and 
>>>>>>> you'd
>>>>>>> loose
>>>>>>> all the computations that had already been done up to that point.
>>>>>>>
>>>>>>>    I could be wrong too since some programs might be smart  
>>>>>>> enough to
>>>>>>> know
>>>>>>> how to pick up where it left off when re-run, without user
>>>>>>> intervention.
>>>>>>> In this case it might be a re-runnable job since it actually can
>>>>>>> continue
>>>>>>> on from where it last left off. I hope this helps,
>>>>>>>
>>>>>>> -Steve
>>>>>>>
>>>>>>>
>>>>>>> On Apr 16, 2009, at 8:39 PM, Victor Gregorio wrote:
>>>>>>>
>>>>>>>> Hey folks,
>>>>>>>>
>>>>>>>> I configured the compute nodes to use persistent, disk 
>>>>>>>> storage for
>>>>>>>> the
>>>>>>>> mom_priv folder.
>>>>>>>>
>>>>>>>> Now, when the failover from primary to secondary pbs_server  
>>>>>>>> occurs,
>>>>>>>> running jobs marked rerunable do not abort.   As hoped, they 
>>>>>>>> show
>>>>>>>> up in
>>>>>>>> the Running state, then Exit and Complete.  But, when you 
>>>>>>>> look at
>>>>>>>> the
>>>>>>>> output, the job does not complete it's task.  The job output is
>>>>>>>> truncated at the point of pbs_server failover.
>>>>>>>>
>>>>>>>> I expected these rerunable jobs to be requeued when the failover
>>>>>>>> occurred.  When do jobs get rerun?  Or am I misunderstanding 
>>>>>>>> what
>>>>>>>> rerunable jobs are?
>>>>>>>>
>>>>>>>> Reference:
>>>>>>>> http://www.clusterresources.com/pipermail/torqueusers/2006-August/004107.html
>>>>>>>>
>>>>>>>> Thank you,
>>>>>>>>
>>>>>>>> -- 
>>>>>>>> Victor Gregorio
>>>>>>>> Penguin Computing
>>>>>>>>
>>>>>>>> On Wed, Apr 15, 2009 at 03:37:31PM -0700, Victor Gregorio wrote:
>>>>>>>>> Hello all,
>>>>>>>>>
>>>>>>>>> I have a Torque failover configuration using CentOS' heartbeat
>>>>>>>>> services.
>>>>>>>>> When the failover from the primary to the secondary pbs_server
>>>>>>>>> system
>>>>>>>>> happens, all running jobs marked rerunable try to restart 
>>>>>>>>> on the
>>>>>>>>> secondary system.
>>>>>>>>>
>>>>>>>>> The problem is that the diskless compute nodes get re- 
>>>>>>>>> provisioned,
>>>>>>>>> destroying all pbs_mom information about the running jobs.  So,
>>>>>>>>> when
>>>>>>>>> the
>>>>>>>>> rerunable jobs try to run, they are aborted with "Job does not
>>>>>>>>> exist
>>>>>>>>> on
>>>>>>>>> node".
>>>>>>>>>
>>>>>>>>> Is there a way for the pbs_server to resubmit a rerunable 
>>>>>>>>> job using
>>>>>>>>> data
>>>>>>>>> that is kept on the pbs_server?  I have tried tweaking the -t
>>>>>>>>> option
>>>>>>>>> to
>>>>>>>>> 'hot' as well as changing mom_job_sync to False, but no  
>>>>>>>>> luck.  Any
>>>>>>>>> advice is appreciated.
>>>>>>>>>
>>>>>>>>> -- 
>>>>>>>>> Victor Gregorio
>>>>>>>>> Penguin Computing
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> torqueusers mailing list
>>>>>>>>> torqueusers at supercluster.org
>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>> _______________________________________________
>>>>>>>> torqueusers mailing list
>>>>>>>> torqueusers at supercluster.org
>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>
>>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list