[torqueusers] Rerunable jobs for diskless nodes?
Victor Gregorio
vgregorio at penguincomputing.com
Wed Apr 15 16:37:31 MDT 2009
Hello all,
I have a Torque failover configuration using CentOS' heartbeat services.
When the failover from the primary to the secondary pbs_server system
happens, all running jobs marked rerunable try to restart on the
secondary system.
The problem is that the diskless compute nodes get re-provisioned,
destroying all pbs_mom information about the running jobs. So, when the
rerunable jobs try to run, they are aborted with "Job does not exist on
node".
Is there a way for the pbs_server to resubmit a rerunable job using data
that is kept on the pbs_server? I have tried tweaking the -t option to
'hot' as well as changing mom_job_sync to False, but no luck. Any
advice is appreciated.
--
Victor Gregorio
Penguin Computing
More information about the torqueusers
mailing list