[torqueusers] getting torque/ pbs to reboot a node periodically.

Brock Palen brockp at umich.edu
Tue Dec 9 13:06:45 MST 2008


Could this be done with a moab/maui hack?

 From cron every 10 days submit jobs one per node with:

#PBS -l host=$host,naccesspolicy=SINGLEJOB

Those jobs would be submitted by a user who has 'sudo reboot'  rights.
You can also use a moab qos QFLAGS=NTR
So that that job is the next to run on the node.

This way the schedular says:

This job is the next job on node X because it can only run on node X  
(hosts=$host, QFLAGS=NTR)
SINGLEJOB forces that job to be the only job running on that node  
when reboot is ran by the user with sudoer's rights to reboot.

This is 100% hack, and I do not endorse it. Though it might just work.

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985



On Dec 9, 2008, at 2:17 PM, Garrick Staples wrote:

> On Tue, Dec 09, 2008 at 11:46:05AM -0600, Rahul Nabar alleged:
>> Is there any way to get pbs/torque to get a node to reboot  
>> periodically? Our
>> compute-nodes keep  running forever and we suspect that overtime  
>> accumulate
>> zombie processes, memory leaks etc. Making each node reboot, say,  
>> on an
>> average once every 10 days or so is not a heavy overhead for us.  
>> After all a
>> reboot is done in less than 5 minutes. These reboots could also be  
>> used by
>> me to do some periodic logfile cleanup etc. {We have shared nodes 8
>> cores/node; so cannot really wipe out my scratch etc. through an  
>> epilouge
>> since another job might be running on the other cpus; and under  
>> normal
>> circumstances it is not usual to have a completely free node.}
>>
>> What's the best way to auto-schedule this? Ideally I do not want  
>> the whole
>> cluster to reboot. In fact, I don't want to over-specify things at  
>> all.
>> Maybe the schedular can choose nodes to reboot based on its  
>> scheduling
>> strategy. Just so long as it rebooots each node "on an average"  
>> once every
>> 10 days.
>>
>> Any sugesstions on implimentation?
>
> It is actually difficult to do while avoiding possible race  
> conditions.
>
> First, you need to drain the nodes by marking them offline.  Then  
> you need to
> mark them for reboot using the node note.  Then a script can reboot  
> nodes when
> it finds them offline, without a job, and marked for reboot.
>
> -- 
> Garrick Staples, GNU/Linux HPCC SysAdmin
> University of Southern California
>
> See the Dishonor Roll at http://www.californiansagainsthate.com/
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list