[torqueusers] Rerunable jobs for diskless nodes?

Victor Gregorio vgregorio at penguincomputing.com
Fri Apr 17 20:47:30 MDT 2009


Oops!

:%s/exists/exits/g 

-- 
Victor Gregorio
Penguin Computing

On Fri, Apr 17, 2009 at 02:40:47PM -0700, Victor Gregorio wrote:
> Hey folks, 
> 
> I think there might be a Torque bug regarding rerunable jobs starting in
> version 2.1.10.  We tested various Torque releases using the following
> scenario:
> 
> 1) Submit a rerunable job that uses all nodes
> 2) Reboot all pbs_mom nodes after job is in Running state
> 3) Wait for nodes to reboot and pbs_moms to start
> 4) See what happens to the rerunable job
> 
> Rerunable jobs only restarted in version 2.1.9.  Here are the details:
> 
> Version          Results
> =======          =======
> 2.1.9            Job restarts as expected
> 2.1.10           Job does not restart, it hangs in exit state
> 2.1.11           Job does not restart, it exits and completes
> 2.3.0            Job does not restart, it hangs in exit state
> 2.3.6            Job does not restart, it exists and completes
> 2.3.7            Job does not restart, it exists and completes
> 2.4.1b1          Job does not restart, it exists and completes
> 
> Please advise.
> 
> -- 
> Victor Gregorio
> Penguin Computing
> 
> On Fri, Apr 17, 2009 at 11:21:31AM -0700, Victor Gregorio wrote:
> > Unfortunately, my running jobs submitted with #PBS -r y are not
> > restarting when a pbs_server failover happens and the whole cluster
> > reboots.  
> > 
> > To limit the scope of my testing, I removed the pbs_server failover from
> > the scenario and just rebooted the compute nodes in the middle of
> > executing a rerunable job.
> > 
> > The results were the same: Once the pbs_moms reconnect to the
> > pbs_server, any previously running, rerunable jobs Exit prematurely and
> > are marked Completed instead of restarting.
> > 
> > Am I missing a configuration tweak? Maybe I am misunderstanding how
> > rerunable jobs behave?  Any advice is appreciated.
> > 
> > -- 
> > Victor Gregorio
> > Penguin Computing
> > 
> > On Fri, Apr 17, 2009 at 01:18:12PM -0400, Steve Young wrote:
> > > comments below....
> > >
> > > On Apr 17, 2009, at 12:55 PM, Victor Gregorio wrote:
> > >
> > >> Hey Steve,
> > >>
> > >> Thanks for the reply. My comments are inline...
> > >>
> > >> On Fri, Apr 17, 2009 at 09:54:23AM -0400, Steve Young wrote:
> > >>> Hi,
> > >>> 	I'm not using HA on our grid but I thought I would add that I can
> > >>> restart torque/maui on the server and it doesn't effect the running  
> > >>> jobs
> > >>> on the nodes. I would like to think HA would operate in the same  
> > >>> fashion.
> > >>
> > >> In my configuration, the compute nodes reboot during the pbs_server
> > >> failover.  Since the mom_priv folder is on persistent disk storage, I
> > >> expected rerunable jobs in the Running state to be requeued.
> > >>
> > >
> > > yea if the pbs_mom's restart your job is done... I'd expect that they  
> > > *should* get re-run and in fact I believe our grid works like this (but 
> > > I'm not for certain really haven't fully tested it and I'm in need of a 
> > > torque upgrade soon anyhow). I do know that a recent power outage and and 
> > > a bad UPS that some nodes were rebooted and those jobs did restart. But 
> > > the troubling thing is I'm not sure of is why since the user didn't 
> > > specify this in their batch script. I know another user's jobs did not 
> > > re-start too.
> > >
> > >
> > >>> 	As for re-runnable jobs here's my take on it. A job that can be re- 
> > >>> run is
> > >>> one that can be started over from the beginning. Meaning the program 
> > >>> your
> > >>> using can't do checkpointing so if it failed part way through the  
> > >>> only
> > >>> thing you can do is start it over. At least by making this job re- 
> > >>> runnable
> > >>> you don't have to manually re-queue the job and loose your place in  
> > >>> the
> > >>> queue.
> > >>
> > >> This is what I expected as well, but have not been able to create a
> > >> scenario where a job restarts from the beginning.  Should running,
> > >> rerunable jobs restart if compute nodes are rebooted during a  
> > >> pbs_server
> > >> failover?
> > >>
> > >
> > >
> > > I would think yes especially if you declare #PBS -r y.
> > >
> > >
> > >
> > >> Regards,
> > >>
> > >> -- 
> > >> Victor Gregorio
> > >> Penguin Computing
> > >>
> > >>> 	However, some types of jobs can be set up to pick up where they left 
> > >>> off
> > >>> if the job failed for some reason. This is not re-runnable since  
> > >>> usually
> > >>> the user needs to change some of the files around for the job to make 
> > >>> sure
> > >>> it picks up where it left off running from before. You wouldn't want 
> > >>> this
> > >>> job to re-run since it would start over from the beginning and you'd 
> > >>> loose
> > >>> all the computations that had already been done up to that point.
> > >>>
> > >>> 	I could be wrong too since some programs might be smart enough to  
> > >>> know
> > >>> how to pick up where it left off when re-run, without user  
> > >>> intervention.
> > >>> In this case it might be a re-runnable job since it actually can  
> > >>> continue
> > >>> on from where it last left off. I hope this helps,
> > >>>
> > >>> -Steve
> > >>>
> > >>>
> > >>> On Apr 16, 2009, at 8:39 PM, Victor Gregorio wrote:
> > >>>
> > >>>> Hey folks,
> > >>>>
> > >>>> I configured the compute nodes to use persistent, disk storage for  
> > >>>> the
> > >>>> mom_priv folder.
> > >>>>
> > >>>> Now, when the failover from primary to secondary pbs_server occurs,
> > >>>> running jobs marked rerunable do not abort.   As hoped, they show  
> > >>>> up in
> > >>>> the Running state, then Exit and Complete.  But, when you look at  
> > >>>> the
> > >>>> output, the job does not complete it's task.  The job output is
> > >>>> truncated at the point of pbs_server failover.
> > >>>>
> > >>>> I expected these rerunable jobs to be requeued when the failover
> > >>>> occurred.  When do jobs get rerun?  Or am I misunderstanding what
> > >>>> rerunable jobs are?
> > >>>>
> > >>>> Reference:
> > >>>> http://www.clusterresources.com/pipermail/torqueusers/2006-August/004107.html
> > >>>>
> > >>>> Thank you,
> > >>>>
> > >>>> -- 
> > >>>> Victor Gregorio
> > >>>> Penguin Computing
> > >>>>
> > >>>> On Wed, Apr 15, 2009 at 03:37:31PM -0700, Victor Gregorio wrote:
> > >>>>> Hello all,
> > >>>>>
> > >>>>> I have a Torque failover configuration using CentOS' heartbeat
> > >>>>> services.
> > >>>>> When the failover from the primary to the secondary pbs_server  
> > >>>>> system
> > >>>>> happens, all running jobs marked rerunable try to restart on the
> > >>>>> secondary system.
> > >>>>>
> > >>>>> The problem is that the diskless compute nodes get re-provisioned,
> > >>>>> destroying all pbs_mom information about the running jobs.  So,  
> > >>>>> when
> > >>>>> the
> > >>>>> rerunable jobs try to run, they are aborted with "Job does not  
> > >>>>> exist
> > >>>>> on
> > >>>>> node".
> > >>>>>
> > >>>>> Is there a way for the pbs_server to resubmit a rerunable job using
> > >>>>> data
> > >>>>> that is kept on the pbs_server?  I have tried tweaking the -t  
> > >>>>> option
> > >>>>> to
> > >>>>> 'hot' as well as changing mom_job_sync to False, but no luck.  Any
> > >>>>> advice is appreciated.
> > >>>>>
> > >>>>> -- 
> > >>>>> Victor Gregorio
> > >>>>> Penguin Computing
> > >>>>>
> > >>>>> _______________________________________________
> > >>>>> torqueusers mailing list
> > >>>>> torqueusers at supercluster.org
> > >>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
> > >>>> _______________________________________________
> > >>>> torqueusers mailing list
> > >>>> torqueusers at supercluster.org
> > >>>> http://www.supercluster.org/mailman/listinfo/torqueusers
> > >>>
> > >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list