[torqueusers] Re: getting torque/ pbs to reboot a node periodically

Justin Finnerty justin.finnerty at uni-oldenburg.de
Tue Dec 9 13:43:22 MST 2008

I would also suggest attacking this from the other direction.

* You said that you wanted to clean up scratch/temporary files.  We had
the problem of users accidently leaving data on a node's scratch space. 
Eventually we made the scratch directory writable only by root.  pbs_mom
(effectively) creates the TMPDIR as root then sets the user ownership
which gives users access only to per-node disk space which is always
cleaned up when the job ends (This assumes that your scratch space is not
/tmp!) As we also clean the scratch directory on a reboot this has
completely elminated all our problems with cleaning per-node scratch

* The only memory leaks that can affect a node after a job ends are lost
shared-memory segments.  This topic has been covered before and some
suggestions for clean-up scripts have appeared on this list.

* Why worry about zombies?  Unless you have thousands of them, in which
case I would be jumping on the users to fix their code.  I may be wrong,
but I think they are just dead entries in the process table and the linux
kernel ignores them for scheduling so they should have zero impact on the

Rebooting the node via a queue has obvious problems.  What do people feel
about the following.

* Have a queue administrator create a cron job to submit a job that
requires all the resources of a node (or the node exclusive job property).
 All this job does is write a special file into /tmp (eg
/tmp/go.for.reboot) and quits.

* Set up your pbs_mom healthcheck script to check for this file and set
the node 'down' when present.  (Shouldn't this stop a new job starting on
the node?)

* Have a cron job on the node that reboots the node when the
/tmp/go.for.reboot file is present.  (Perhaps you should check the file's
ownership to verify that some other user is not messing about.)

* Remove the /tmp/go.for.reboot in your boot scripts (eg rc.local)



Dr Justin Finnerty
Rm W3-1-218         Ph 49 (441) 798 3726
Carl von Ossietzky Universität Oldenburg

More information about the torqueusers mailing list