[torqueusers] getting torque/ pbs to reboot a node periodically.

Garrick Staples garrick at usc.edu
Tue Dec 9 12:17:08 MST 2008


On Tue, Dec 09, 2008 at 11:46:05AM -0600, Rahul Nabar alleged:
> Is there any way to get pbs/torque to get a node to reboot periodically? Our
> compute-nodes keep  running forever and we suspect that overtime accumulate
> zombie processes, memory leaks etc. Making each node reboot, say, on an
> average once every 10 days or so is not a heavy overhead for us. After all a
> reboot is done in less than 5 minutes. These reboots could also be used by
> me to do some periodic logfile cleanup etc. {We have shared nodes 8
> cores/node; so cannot really wipe out my scratch etc. through an epilouge
> since another job might be running on the other cpus; and under normal
> circumstances it is not usual to have a completely free node.}
> 
> What's the best way to auto-schedule this? Ideally I do not want the whole
> cluster to reboot. In fact, I don't want to over-specify things at all.
> Maybe the schedular can choose nodes to reboot based on its scheduling
> strategy. Just so long as it rebooots each node "on an average" once every
> 10 days.
> 
> Any sugesstions on implimentation?

It is actually difficult to do while avoiding possible race conditions.

First, you need to drain the nodes by marking them offline.  Then you need to
mark them for reboot using the node note.  Then a script can reboot nodes when
it finds them offline, without a job, and marked for reboot.

-- 
Garrick Staples, GNU/Linux HPCC SysAdmin
University of Southern California

See the Dishonor Roll at http://www.californiansagainsthate.com/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20081209/81f0de9e/attachment.bin


More information about the torqueusers mailing list