[torqueusers] Rerunable jobs for diskless nodes?

Victor Gregorio vgregorio at penguincomputing.com
Fri Apr 17 15:40:47 MDT 2009


Hey folks, 

I think there might be a Torque bug regarding rerunable jobs starting in
version 2.1.10.  We tested various Torque releases using the following
scenario:

1) Submit a rerunable job that uses all nodes
2) Reboot all pbs_mom nodes after job is in Running state
3) Wait for nodes to reboot and pbs_moms to start
4) See what happens to the rerunable job

Rerunable jobs only restarted in version 2.1.9.  Here are the details:

Version          Results
=======          =======
2.1.9            Job restarts as expected
2.1.10           Job does not restart, it hangs in exit state
2.1.11           Job does not restart, it exits and completes
2.3.0            Job does not restart, it hangs in exit state
2.3.6            Job does not restart, it exists and completes
2.3.7            Job does not restart, it exists and completes
2.4.1b1          Job does not restart, it exists and completes

Please advise.

-- 
Victor Gregorio
Penguin Computing

On Fri, Apr 17, 2009 at 11:21:31AM -0700, Victor Gregorio wrote:
> Unfortunately, my running jobs submitted with #PBS -r y are not
> restarting when a pbs_server failover happens and the whole cluster
> reboots.  
> 
> To limit the scope of my testing, I removed the pbs_server failover from
> the scenario and just rebooted the compute nodes in the middle of
> executing a rerunable job.
> 
> The results were the same: Once the pbs_moms reconnect to the
> pbs_server, any previously running, rerunable jobs Exit prematurely and
> are marked Completed instead of restarting.
> 
> Am I missing a configuration tweak? Maybe I am misunderstanding how
> rerunable jobs behave?  Any advice is appreciated.
> 
> -- 
> Victor Gregorio
> Penguin Computing
> 
> On Fri, Apr 17, 2009 at 01:18:12PM -0400, Steve Young wrote:
> > comments below....
> >
> > On Apr 17, 2009, at 12:55 PM, Victor Gregorio wrote:
> >
> >> Hey Steve,
> >>
> >> Thanks for the reply. My comments are inline...
> >>
> >> On Fri, Apr 17, 2009 at 09:54:23AM -0400, Steve Young wrote:
> >>> Hi,
> >>> 	I'm not using HA on our grid but I thought I would add that I can
> >>> restart torque/maui on the server and it doesn't effect the running  
> >>> jobs
> >>> on the nodes. I would like to think HA would operate in the same  
> >>> fashion.
> >>
> >> In my configuration, the compute nodes reboot during the pbs_server
> >> failover.  Since the mom_priv folder is on persistent disk storage, I
> >> expected rerunable jobs in the Running state to be requeued.
> >>
> >
> > yea if the pbs_mom's restart your job is done... I'd expect that they  
> > *should* get re-run and in fact I believe our grid works like this (but 
> > I'm not for certain really haven't fully tested it and I'm in need of a 
> > torque upgrade soon anyhow). I do know that a recent power outage and and 
> > a bad UPS that some nodes were rebooted and those jobs did restart. But 
> > the troubling thing is I'm not sure of is why since the user didn't 
> > specify this in their batch script. I know another user's jobs did not 
> > re-start too.
> >
> >
> >>> 	As for re-runnable jobs here's my take on it. A job that can be re- 
> >>> run is
> >>> one that can be started over from the beginning. Meaning the program 
> >>> your
> >>> using can't do checkpointing so if it failed part way through the  
> >>> only
> >>> thing you can do is start it over. At least by making this job re- 
> >>> runnable
> >>> you don't have to manually re-queue the job and loose your place in  
> >>> the
> >>> queue.
> >>
> >> This is what I expected as well, but have not been able to create a
> >> scenario where a job restarts from the beginning.  Should running,
> >> rerunable jobs restart if compute nodes are rebooted during a  
> >> pbs_server
> >> failover?
> >>
> >
> >
> > I would think yes especially if you declare #PBS -r y.
> >
> >
> >
> >> Regards,
> >>
> >> -- 
> >> Victor Gregorio
> >> Penguin Computing
> >>
> >>> 	However, some types of jobs can be set up to pick up where they left 
> >>> off
> >>> if the job failed for some reason. This is not re-runnable since  
> >>> usually
> >>> the user needs to change some of the files around for the job to make 
> >>> sure
> >>> it picks up where it left off running from before. You wouldn't want 
> >>> this
> >>> job to re-run since it would start over from the beginning and you'd 
> >>> loose
> >>> all the computations that had already been done up to that point.
> >>>
> >>> 	I could be wrong too since some programs might be smart enough to  
> >>> know
> >>> how to pick up where it left off when re-run, without user  
> >>> intervention.
> >>> In this case it might be a re-runnable job since it actually can  
> >>> continue
> >>> on from where it last left off. I hope this helps,
> >>>
> >>> -Steve
> >>>
> >>>
> >>> On Apr 16, 2009, at 8:39 PM, Victor Gregorio wrote:
> >>>
> >>>> Hey folks,
> >>>>
> >>>> I configured the compute nodes to use persistent, disk storage for  
> >>>> the
> >>>> mom_priv folder.
> >>>>
> >>>> Now, when the failover from primary to secondary pbs_server occurs,
> >>>> running jobs marked rerunable do not abort.   As hoped, they show  
> >>>> up in
> >>>> the Running state, then Exit and Complete.  But, when you look at  
> >>>> the
> >>>> output, the job does not complete it's task.  The job output is
> >>>> truncated at the point of pbs_server failover.
> >>>>
> >>>> I expected these rerunable jobs to be requeued when the failover
> >>>> occurred.  When do jobs get rerun?  Or am I misunderstanding what
> >>>> rerunable jobs are?
> >>>>
> >>>> Reference:
> >>>> http://www.clusterresources.com/pipermail/torqueusers/2006-August/004107.html
> >>>>
> >>>> Thank you,
> >>>>
> >>>> -- 
> >>>> Victor Gregorio
> >>>> Penguin Computing
> >>>>
> >>>> On Wed, Apr 15, 2009 at 03:37:31PM -0700, Victor Gregorio wrote:
> >>>>> Hello all,
> >>>>>
> >>>>> I have a Torque failover configuration using CentOS' heartbeat
> >>>>> services.
> >>>>> When the failover from the primary to the secondary pbs_server  
> >>>>> system
> >>>>> happens, all running jobs marked rerunable try to restart on the
> >>>>> secondary system.
> >>>>>
> >>>>> The problem is that the diskless compute nodes get re-provisioned,
> >>>>> destroying all pbs_mom information about the running jobs.  So,  
> >>>>> when
> >>>>> the
> >>>>> rerunable jobs try to run, they are aborted with "Job does not  
> >>>>> exist
> >>>>> on
> >>>>> node".
> >>>>>
> >>>>> Is there a way for the pbs_server to resubmit a rerunable job using
> >>>>> data
> >>>>> that is kept on the pbs_server?  I have tried tweaking the -t  
> >>>>> option
> >>>>> to
> >>>>> 'hot' as well as changing mom_job_sync to False, but no luck.  Any
> >>>>> advice is appreciated.
> >>>>>
> >>>>> -- 
> >>>>> Victor Gregorio
> >>>>> Penguin Computing
> >>>>>
> >>>>> _______________________________________________
> >>>>> torqueusers mailing list
> >>>>> torqueusers at supercluster.org
> >>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
> >>>> _______________________________________________
> >>>> torqueusers mailing list
> >>>> torqueusers at supercluster.org
> >>>> http://www.supercluster.org/mailman/listinfo/torqueusers
> >>>
> >
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list