[torqueusers] Rerunable jobs for diskless nodes?

Victor Gregorio vgregorio at penguincomputing.com
Wed Apr 15 16:37:31 MDT 2009


Hello all,

I have a Torque failover configuration using CentOS' heartbeat services.
When the failover from the primary to the secondary pbs_server system
happens, all running jobs marked rerunable try to restart on the
secondary system.

The problem is that the diskless compute nodes get re-provisioned,
destroying all pbs_mom information about the running jobs.  So, when the
rerunable jobs try to run, they are aborted with "Job does not exist on
node".

Is there a way for the pbs_server to resubmit a rerunable job using data
that is kept on the pbs_server?  I have tried tweaking the -t option to
'hot' as well as changing mom_job_sync to False, but no luck.  Any
advice is appreciated.

-- 
Victor Gregorio
Penguin Computing



More information about the torqueusers mailing list