[torqueusers] Rerunable jobs for diskless nodes?

Victor Gregorio vgregorio at penguincomputing.com
Thu Apr 16 18:39:47 MDT 2009


Hey folks,

I configured the compute nodes to use persistent, disk storage for the
mom_priv folder.

Now, when the failover from primary to secondary pbs_server occurs,
running jobs marked rerunable do not abort.   As hoped, they show up in
the Running state, then Exit and Complete.  But, when you look at the
output, the job does not complete it's task.  The job output is
truncated at the point of pbs_server failover.

I expected these rerunable jobs to be requeued when the failover
occurred.  When do jobs get rerun?  Or am I misunderstanding what
rerunable jobs are?

Reference:
http://www.clusterresources.com/pipermail/torqueusers/2006-August/004107.html

Thank you,

-- 
Victor Gregorio
Penguin Computing

On Wed, Apr 15, 2009 at 03:37:31PM -0700, Victor Gregorio wrote:
> Hello all,
> 
> I have a Torque failover configuration using CentOS' heartbeat services.
> When the failover from the primary to the secondary pbs_server system
> happens, all running jobs marked rerunable try to restart on the
> secondary system.
> 
> The problem is that the diskless compute nodes get re-provisioned,
> destroying all pbs_mom information about the running jobs.  So, when the
> rerunable jobs try to run, they are aborted with "Job does not exist on
> node".
> 
> Is there a way for the pbs_server to resubmit a rerunable job using data
> that is kept on the pbs_server?  I have tried tweaking the -t option to
> 'hot' as well as changing mom_job_sync to False, but no luck.  Any
> advice is appreciated.
> 
> -- 
> Victor Gregorio
> Penguin Computing
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list