[torqueusers] getting torque/ pbs to reboot a node periodically.

Billy Crook billycrook at gmail.com
Tue Dec 9 13:16:39 MST 2008


Why not submit, as a job, "/sbin/reboot"?  Or if permissions would be
an issue, something suid.  You'd request all resources on the node,
and a job time of ten minutes.  The point being to occupy a node
legitimately, and when your time comes as regulated by torque, reboot
the node.  The job would probably fail, but when the node comes back
online it should rejoin the queue and be available again right?

P.S.  Credit also to Brock who beat me to it.

On Tue, Dec 9, 2008 at 13:17, Garrick Staples <garrick at usc.edu> wrote:
> On Tue, Dec 09, 2008 at 11:46:05AM -0600, Rahul Nabar alleged:
>> Is there any way to get pbs/torque to get a node to reboot periodically? Our
>> compute-nodes keep  running forever and we suspect that overtime accumulate
>> zombie processes, memory leaks etc. Making each node reboot, say, on an
>> average once every 10 days or so is not a heavy overhead for us. After all a
>> reboot is done in less than 5 minutes. These reboots could also be used by
>> me to do some periodic logfile cleanup etc. {We have shared nodes 8
>> cores/node; so cannot really wipe out my scratch etc. through an epilouge
>> since another job might be running on the other cpus; and under normal
>> circumstances it is not usual to have a completely free node.}
>>
>> What's the best way to auto-schedule this? Ideally I do not want the whole
>> cluster to reboot. In fact, I don't want to over-specify things at all.
>> Maybe the schedular can choose nodes to reboot based on its scheduling
>> strategy. Just so long as it rebooots each node "on an average" once every
>> 10 days.
>>
>> Any sugesstions on implimentation?
>
> It is actually difficult to do while avoiding possible race conditions.
>
> First, you need to drain the nodes by marking them offline.  Then you need to
> mark them for reboot using the node note.  Then a script can reboot nodes when
> it finds them offline, without a job, and marked for reboot.
>
> --
> Garrick Staples, GNU/Linux HPCC SysAdmin
> University of Southern California
>
> See the Dishonor Roll at http://www.californiansagainsthate.com/
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


More information about the torqueusers mailing list