[torqueusers] Rerunable jobs for diskless nodes?

Steve Young chemadm at hamilton.edu
Fri Apr 17 07:54:23 MDT 2009


Hi,
	I'm not using HA on our grid but I thought I would add that I can  
restart torque/maui on the server and it doesn't effect the running  
jobs on the nodes. I would like to think HA would operate in the same  
fashion.

	As for re-runnable jobs here's my take on it. A job that can be re- 
run is one that can be started over from the beginning. Meaning the  
program your using can't do checkpointing so if it failed part way  
through the only thing you can do is start it over. At least by making  
this job re-runnable you don't have to manually re-queue the job and  
loose your place in the queue.

	However, some types of jobs can be set up to pick up where they left  
off if the job failed for some reason. This is not re-runnable since  
usually the user needs to change some of the files around for the job  
to make sure it picks up where it left off running from before. You  
wouldn't want this job to re-run since it would start over from the  
beginning and you'd loose all the computations that had already been  
done up to that point.

	I could be wrong too since some programs might be smart enough to  
know how to pick up where it left off when re-run, without user  
intervention. In this case it might be a re-runnable job since it  
actually can continue on from where it last left off. I hope this helps,

-Steve


On Apr 16, 2009, at 8:39 PM, Victor Gregorio wrote:

> Hey folks,
>
> I configured the compute nodes to use persistent, disk storage for the
> mom_priv folder.
>
> Now, when the failover from primary to secondary pbs_server occurs,
> running jobs marked rerunable do not abort.   As hoped, they show up  
> in
> the Running state, then Exit and Complete.  But, when you look at the
> output, the job does not complete it's task.  The job output is
> truncated at the point of pbs_server failover.
>
> I expected these rerunable jobs to be requeued when the failover
> occurred.  When do jobs get rerun?  Or am I misunderstanding what
> rerunable jobs are?
>
> Reference:
> http://www.clusterresources.com/pipermail/torqueusers/2006-August/004107.html
>
> Thank you,
>
> -- 
> Victor Gregorio
> Penguin Computing
>
> On Wed, Apr 15, 2009 at 03:37:31PM -0700, Victor Gregorio wrote:
>> Hello all,
>>
>> I have a Torque failover configuration using CentOS' heartbeat  
>> services.
>> When the failover from the primary to the secondary pbs_server system
>> happens, all running jobs marked rerunable try to restart on the
>> secondary system.
>>
>> The problem is that the diskless compute nodes get re-provisioned,
>> destroying all pbs_mom information about the running jobs.  So,  
>> when the
>> rerunable jobs try to run, they are aborted with "Job does not  
>> exist on
>> node".
>>
>> Is there a way for the pbs_server to resubmit a rerunable job using  
>> data
>> that is kept on the pbs_server?  I have tried tweaking the -t  
>> option to
>> 'hot' as well as changing mom_job_sync to False, but no luck.  Any
>> advice is appreciated.
>>
>> -- 
>> Victor Gregorio
>> Penguin Computing
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list