[torqueusers] getting torque/ pbs to reboot a node periodically.
Garrick Staples
garrick at usc.edu
Tue Dec 9 14:00:27 MST 2008
On Tue, Dec 09, 2008 at 09:42:55PM +0100, Bogdan Costescu alleged:
>
> >>There is no extra script looking for the node note, the Python
> >>script polls the state of the node until it's only "offline",
> >>proceeds to do whatever it needs to reboot the node and as soon as
> >>the node goes into state "down" it clears the "offline" state.
> >
> >Without marking the node for reboot in some fashion, how do you know
> >which nodes to reboot?
>
> The script knows which nodes it needs to reboot; it ignores other
> nodes which are in "offline" state. If a node is marked "offline"
> manually but the script is still asked to reboot it, what difference
> could it make that the "offline" state was aquired from an admin or
> from the script itself as long as the final result is the same:
> draining of the node ?
Oh, so you are just using a different mechanism to tag the nodes to be
rebooted.
> >And your script doesn't check to see if it has a running job?
>
> You missed the 'polls the state of the node until it's only "offline"'
> or maybe I missed making it more verbose and saying 'and doesn't
> contain other states related to running jobs, like "job-exclusive"'.
You missed that nodes routinely have running jobs while having only the "free",
"busy", or "offline" states. "job-exclusive" is only a special case where a
job has all of the node's resources. This might be the norm on your cluster,
but it isn't the norm every where.
--
Garrick Staples, GNU/Linux HPCC SysAdmin
University of Southern California
See the Dishonor Roll at http://www.californiansagainsthate.com/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20081209/a51ab9a7/attachment.bin
More information about the torqueusers
mailing list