[torqueusers] Re: getting torque/ pbs to reboot a node
justin.finnerty at uni-oldenburg.de
Tue Dec 9 13:43:22 MST 2008
I would also suggest attacking this from the other direction.
* You said that you wanted to clean up scratch/temporary files. We had
the problem of users accidently leaving data on a node's scratch space.
Eventually we made the scratch directory writable only by root. pbs_mom
(effectively) creates the TMPDIR as root then sets the user ownership
which gives users access only to per-node disk space which is always
cleaned up when the job ends (This assumes that your scratch space is not
/tmp!) As we also clean the scratch directory on a reboot this has
completely elminated all our problems with cleaning per-node scratch
* The only memory leaks that can affect a node after a job ends are lost
shared-memory segments. This topic has been covered before and some
suggestions for clean-up scripts have appeared on this list.
* Why worry about zombies? Unless you have thousands of them, in which
case I would be jumping on the users to fix their code. I may be wrong,
but I think they are just dead entries in the process table and the linux
kernel ignores them for scheduling so they should have zero impact on the
Rebooting the node via a queue has obvious problems. What do people feel
about the following.
* Have a queue administrator create a cron job to submit a job that
requires all the resources of a node (or the node exclusive job property).
All this job does is write a special file into /tmp (eg
/tmp/go.for.reboot) and quits.
* Set up your pbs_mom healthcheck script to check for this file and set
the node 'down' when present. (Shouldn't this stop a new job starting on
* Have a cron job on the node that reboots the node when the
/tmp/go.for.reboot file is present. (Perhaps you should check the file's
ownership to verify that some other user is not messing about.)
* Remove the /tmp/go.for.reboot in your boot scripts (eg rc.local)
Dr Justin Finnerty
Rm W3-1-218 Ph 49 (441) 798 3726
Carl von Ossietzky Universität Oldenburg
More information about the torqueusers