[torqueusers] Re: getting torque/ pbs to reboot a node periodically

Justin Finnerty justin.finnerty at uni-oldenburg.de
Tue Dec 9 13:43:22 MST 2008


I would also suggest attacking this from the other direction.

* You said that you wanted to clean up scratch/temporary files.  We had
the problem of users accidently leaving data on a node's scratch space. 
Eventually we made the scratch directory writable only by root.  pbs_mom
(effectively) creates the TMPDIR as root then sets the user ownership
which gives users access only to per-node disk space which is always
cleaned up when the job ends (This assumes that your scratch space is not
/tmp!) As we also clean the scratch directory on a reboot this has
completely elminated all our problems with cleaning per-node scratch
space.

* The only memory leaks that can affect a node after a job ends are lost
shared-memory segments.  This topic has been covered before and some
suggestions for clean-up scripts have appeared on this list.

* Why worry about zombies?  Unless you have thousands of them, in which
case I would be jumping on the users to fix their code.  I may be wrong,
but I think they are just dead entries in the process table and the linux
kernel ignores them for scheduling so they should have zero impact on the
node.

Rebooting the node via a queue has obvious problems.  What do people feel
about the following.

* Have a queue administrator create a cron job to submit a job that
requires all the resources of a node (or the node exclusive job property).
 All this job does is write a special file into /tmp (eg
/tmp/go.for.reboot) and quits.

* Set up your pbs_mom healthcheck script to check for this file and set
the node 'down' when present.  (Shouldn't this stop a new job starting on
the node?)

* Have a cron job on the node that reboots the node when the
/tmp/go.for.reboot file is present.  (Perhaps you should check the file's
ownership to verify that some other user is not messing about.)

* Remove the /tmp/go.for.reboot in your boot scripts (eg rc.local)

Tschuss

 Justin

-- 
Dr Justin Finnerty
Rm W3-1-218         Ph 49 (441) 798 3726
Carl von Ossietzky Universität Oldenburg




More information about the torqueusers mailing list