[torqueusers] getting torque/ pbs to reboot a node periodically.

Garrick Staples garrick at usc.edu
Tue Dec 9 14:00:27 MST 2008

On Tue, Dec 09, 2008 at 09:42:55PM +0100, Bogdan Costescu alleged:
> >>There is no extra script looking for the node note, the Python 
> >>script polls the state of the node until it's only "offline", 
> >>proceeds to do whatever it needs to reboot the node and as soon as 
> >>the node goes into state "down" it clears the "offline" state.
> >
> >Without marking the node for reboot in some fashion, how do you know 
> >which nodes to reboot?
> The script knows which nodes it needs to reboot; it ignores other 
> nodes which are in "offline" state. If a node is marked "offline" 
> manually but the script is still asked to reboot it, what difference 
> could it make that the "offline" state was aquired from an admin or 
> from the script itself as long as the final result is the same: 
> draining of the node ?

Oh, so you are just using a different mechanism to tag the nodes to be

> >And your script doesn't check to see if it has a running job?
> You missed the 'polls the state of the node until it's only "offline"' 
> or maybe I missed making it more verbose and saying 'and doesn't 
> contain other states related to running jobs, like "job-exclusive"'.

You missed that nodes routinely have running jobs while having only the "free",
"busy", or "offline" states.  "job-exclusive" is only a special case where a
job has all of the node's resources.  This might be the norm on your cluster,
but it isn't the norm every where.

Garrick Staples, GNU/Linux HPCC SysAdmin
University of Southern California

See the Dishonor Roll at http://www.californiansagainsthate.com/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20081209/a51ab9a7/attachment.bin

More information about the torqueusers mailing list