[torqueusers] Re: getting torque/ pbs to reboot a node
garrick at usc.edu
Tue Dec 9 13:33:47 MST 2008
On Tue, Dec 09, 2008 at 01:39:37PM -0600, Rahul Nabar alleged:
> >It is actually difficult to do while avoiding possible race conditions.
> >First, you need to drain the nodes by marking them offline. Then you need to
> >mark them for reboot using the node note. Then a script can reboot nodes when
> >it finds them offline, without a job, and marked for reboot.
> Thanks Garrick! How about rebooting at least those nodes that find
> themselves without a job.
Then you run into a race condition. Perhaps the scheduler is about to run a
job on that node?
> Is there a provision so that I can tell pbs to exec a script when it
> finds itself job-free (might work better on my older nodes with only 2
> cores / node)
> I could have this (shell) script to then check when was the last time
> it was rebooted and if too long ago then reboot. What do you think of
> this idea.
You can't do it from pbs_mom alone or you will run into race problems.
> Idea 2: I'd have to submit dummy jobs with a cron from the master node
> that are designed to run on specific nodes. But then again torque will
> not allow a job to execute a reboot command will it? Maybe if
> submitted as a root user?
Torque doesn't know or care what commands you run. But rebooting nodes during
your active job is asking for trouble.
Garrick Staples, GNU/Linux HPCC SysAdmin
University of Southern California
See the Dishonor Roll at http://www.californiansagainsthate.com/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20081209/ddce2d37/attachment.bin
More information about the torqueusers