[torqueusers] Rerunable jobs for diskless nodes?

Garrick garrick at usc.edu
Fri Apr 17 20:49:35 MDT 2009


I don't think job rwstarts or exits have ever been reliable when the  
execution node's mom restarts or reboots.

HPCC/Linux Systems Admin

On Apr 17, 2009, at 7:47 PM, Victor Gregorio <vgregorio at penguincomputing.com 
 > wrote:

> Oops!
>
> :%s/exists/exits/g
>
> -- 
> Victor Gregorio
> Penguin Computing
>
> On Fri, Apr 17, 2009 at 02:40:47PM -0700, Victor Gregorio wrote:
>> Hey folks,
>>
>> I think there might be a Torque bug regarding rerunable jobs  
>> starting in
>> version 2.1.10.  We tested various Torque releases using the  
>> following
>> scenario:
>>
>> 1) Submit a rerunable job that uses all nodes
>> 2) Reboot all pbs_mom nodes after job is in Running state
>> 3) Wait for nodes to reboot and pbs_moms to start
>> 4) See what happens to the rerunable job
>>
>> Rerunable jobs only restarted in version 2.1.9.  Here are the  
>> details:
>>
>> Version          Results
>> =======          =======
>> 2.1.9            Job restarts as expected
>> 2.1.10           Job does not restart, it hangs in exit state
>> 2.1.11           Job does not restart, it exits and completes
>> 2.3.0            Job does not restart, it hangs in exit state
>> 2.3.6            Job does not restart, it exists and completes
>> 2.3.7            Job does not restart, it exists and completes
>> 2.4.1b1          Job does not restart, it exists and completes
>>
>> Please advise.
>>
>> -- 
>> Victor Gregorio
>> Penguin Computing
>>
>> On Fri, Apr 17, 2009 at 11:21:31AM -0700, Victor Gregorio wrote:
>>> Unfortunately, my running jobs submitted with #PBS -r y are not
>>> restarting when a pbs_server failover happens and the whole cluster
>>> reboots.
>>>
>>> To limit the scope of my testing, I removed the pbs_server  
>>> failover from
>>> the scenario and just rebooted the compute nodes in the middle of
>>> executing a rerunable job.
>>>
>>> The results were the same: Once the pbs_moms reconnect to the
>>> pbs_server, any previously running, rerunable jobs Exit  
>>> prematurely and
>>> are marked Completed instead of restarting.
>>>
>>> Am I missing a configuration tweak? Maybe I am misunderstanding how
>>> rerunable jobs behave?  Any advice is appreciated.
>>>
>>> -- 
>>> Victor Gregorio
>>> Penguin Computing
>>>
>>> On Fri, Apr 17, 2009 at 01:18:12PM -0400, Steve Young wrote:
>>>> comments below....
>>>>
>>>> On Apr 17, 2009, at 12:55 PM, Victor Gregorio wrote:
>>>>
>>>>> Hey Steve,
>>>>>
>>>>> Thanks for the reply. My comments are inline...
>>>>>
>>>>> On Fri, Apr 17, 2009 at 09:54:23AM -0400, Steve Young wrote:
>>>>>> Hi,
>>>>>>    I'm not using HA on our grid but I thought I would add that  
>>>>>> I can
>>>>>> restart torque/maui on the server and it doesn't effect the  
>>>>>> running
>>>>>> jobs
>>>>>> on the nodes. I would like to think HA would operate in the same
>>>>>> fashion.
>>>>>
>>>>> In my configuration, the compute nodes reboot during the  
>>>>> pbs_server
>>>>> failover.  Since the mom_priv folder is on persistent disk  
>>>>> storage, I
>>>>> expected rerunable jobs in the Running state to be requeued.
>>>>>
>>>>
>>>> yea if the pbs_mom's restart your job is done... I'd expect that  
>>>> they
>>>> *should* get re-run and in fact I believe our grid works like  
>>>> this (but
>>>> I'm not for certain really haven't fully tested it and I'm in  
>>>> need of a
>>>> torque upgrade soon anyhow). I do know that a recent power outage  
>>>> and and
>>>> a bad UPS that some nodes were rebooted and those jobs did  
>>>> restart. But
>>>> the troubling thing is I'm not sure of is why since the user didn't
>>>> specify this in their batch script. I know another user's jobs  
>>>> did not
>>>> re-start too.
>>>>
>>>>
>>>>>>    As for re-runnable jobs here's my take on it. A job that can  
>>>>>> be re-
>>>>>> run is
>>>>>> one that can be started over from the beginning. Meaning the  
>>>>>> program
>>>>>> your
>>>>>> using can't do checkpointing so if it failed part way through the
>>>>>> only
>>>>>> thing you can do is start it over. At least by making this job  
>>>>>> re-
>>>>>> runnable
>>>>>> you don't have to manually re-queue the job and loose your  
>>>>>> place in
>>>>>> the
>>>>>> queue.
>>>>>
>>>>> This is what I expected as well, but have not been able to  
>>>>> create a
>>>>> scenario where a job restarts from the beginning.  Should running,
>>>>> rerunable jobs restart if compute nodes are rebooted during a
>>>>> pbs_server
>>>>> failover?
>>>>>
>>>>
>>>>
>>>> I would think yes especially if you declare #PBS -r y.
>>>>
>>>>
>>>>
>>>>> Regards,
>>>>>
>>>>> -- 
>>>>> Victor Gregorio
>>>>> Penguin Computing
>>>>>
>>>>>>    However, some types of jobs can be set up to pick up where  
>>>>>> they left
>>>>>> off
>>>>>> if the job failed for some reason. This is not re-runnable since
>>>>>> usually
>>>>>> the user needs to change some of the files around for the job  
>>>>>> to make
>>>>>> sure
>>>>>> it picks up where it left off running from before. You wouldn't  
>>>>>> want
>>>>>> this
>>>>>> job to re-run since it would start over from the beginning and  
>>>>>> you'd
>>>>>> loose
>>>>>> all the computations that had already been done up to that point.
>>>>>>
>>>>>>    I could be wrong too since some programs might be smart  
>>>>>> enough to
>>>>>> know
>>>>>> how to pick up where it left off when re-run, without user
>>>>>> intervention.
>>>>>> In this case it might be a re-runnable job since it actually can
>>>>>> continue
>>>>>> on from where it last left off. I hope this helps,
>>>>>>
>>>>>> -Steve
>>>>>>
>>>>>>
>>>>>> On Apr 16, 2009, at 8:39 PM, Victor Gregorio wrote:
>>>>>>
>>>>>>> Hey folks,
>>>>>>>
>>>>>>> I configured the compute nodes to use persistent, disk storage  
>>>>>>> for
>>>>>>> the
>>>>>>> mom_priv folder.
>>>>>>>
>>>>>>> Now, when the failover from primary to secondary pbs_server  
>>>>>>> occurs,
>>>>>>> running jobs marked rerunable do not abort.   As hoped, they  
>>>>>>> show
>>>>>>> up in
>>>>>>> the Running state, then Exit and Complete.  But, when you look  
>>>>>>> at
>>>>>>> the
>>>>>>> output, the job does not complete it's task.  The job output is
>>>>>>> truncated at the point of pbs_server failover.
>>>>>>>
>>>>>>> I expected these rerunable jobs to be requeued when the failover
>>>>>>> occurred.  When do jobs get rerun?  Or am I misunderstanding  
>>>>>>> what
>>>>>>> rerunable jobs are?
>>>>>>>
>>>>>>> Reference:
>>>>>>> http://www.clusterresources.com/pipermail/torqueusers/2006-August/004107.html
>>>>>>>
>>>>>>> Thank you,
>>>>>>>
>>>>>>> -- 
>>>>>>> Victor Gregorio
>>>>>>> Penguin Computing
>>>>>>>
>>>>>>> On Wed, Apr 15, 2009 at 03:37:31PM -0700, Victor Gregorio wrote:
>>>>>>>> Hello all,
>>>>>>>>
>>>>>>>> I have a Torque failover configuration using CentOS' heartbeat
>>>>>>>> services.
>>>>>>>> When the failover from the primary to the secondary pbs_server
>>>>>>>> system
>>>>>>>> happens, all running jobs marked rerunable try to restart on  
>>>>>>>> the
>>>>>>>> secondary system.
>>>>>>>>
>>>>>>>> The problem is that the diskless compute nodes get re- 
>>>>>>>> provisioned,
>>>>>>>> destroying all pbs_mom information about the running jobs.  So,
>>>>>>>> when
>>>>>>>> the
>>>>>>>> rerunable jobs try to run, they are aborted with "Job does not
>>>>>>>> exist
>>>>>>>> on
>>>>>>>> node".
>>>>>>>>
>>>>>>>> Is there a way for the pbs_server to resubmit a rerunable job  
>>>>>>>> using
>>>>>>>> data
>>>>>>>> that is kept on the pbs_server?  I have tried tweaking the -t
>>>>>>>> option
>>>>>>>> to
>>>>>>>> 'hot' as well as changing mom_job_sync to False, but no  
>>>>>>>> luck.  Any
>>>>>>>> advice is appreciated.
>>>>>>>>
>>>>>>>> -- 
>>>>>>>> Victor Gregorio
>>>>>>>> Penguin Computing
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> torqueusers mailing list
>>>>>>>> torqueusers at supercluster.org
>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>> _______________________________________________
>>>>>>> torqueusers mailing list
>>>>>>> torqueusers at supercluster.org
>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>
>>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list