[torquedev] File not found with heavy PBS use

Chris Samuel csamuel at vpac.org
Thu Jul 16 17:56:43 MDT 2009


----- "Luiz Angelo Daros de Luca" <luizluca at gmail.com> wrote:

> There is a monitor process that checks for running jobs that runned
> more than the walltime in unreachable nodes. The nodes are diskless
> and loses job info on reboot (or crash :-) ).

Try this in qmgr instead:

 set server mom_job_sync = True

That way once the node comes back up pbs_server will
ask it about the jobs it thinks are on it and will
get told "never heard of it".  It'll then mark the
job as aborted (an A record in the logs) and should
tidy up neatly for you.

cheers!
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency


More information about the torquedev mailing list