[torqueusers] Re: getting torque/ pbs to reboot a node periodically

Garrick Staples garrick at usc.edu
Tue Dec 9 13:33:47 MST 2008


On Tue, Dec 09, 2008 at 01:39:37PM -0600, Rahul Nabar alleged:
> >It is actually difficult to do while avoiding possible race conditions.
> 
> >First, you need to drain the nodes by marking them offline.  Then you need to
> >mark them for reboot using the node note.  Then a script can reboot nodes when
> >it finds them offline, without a job, and marked for reboot.
> 
> Thanks Garrick! How about rebooting at least those nodes that find
> themselves without a job.

Then you run into a race condition.  Perhaps the scheduler is about to run a
job on that node?

 
> Is there a provision so that I can tell pbs to exec a script when it
> finds itself job-free (might work better on my older nodes with only 2
> cores / node)
> I could have this (shell) script to then check when was the last time
> it was rebooted and if too long ago then reboot. What do you think of
> this idea.

You can't do it from pbs_mom alone or you will run into race problems.

 
> Idea 2: I'd have to submit dummy jobs with a cron from the master node
> that are designed to run on specific nodes. But then again torque will
> not allow a job to execute a reboot command will it? Maybe if
> submitted as a root user?

Torque doesn't know or care what commands you run.  But rebooting nodes during
your active job is asking for trouble.

-- 
Garrick Staples, GNU/Linux HPCC SysAdmin
University of Southern California

See the Dishonor Roll at http://www.californiansagainsthate.com/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20081209/ddce2d37/attachment.bin


More information about the torqueusers mailing list