[torqueusers] Re: [Mauiusers] node health check

Garrick Staples garrick at clusterresources.com
Wed Nov 15 12:53:52 MST 2006


On Wed, Nov 15, 2006 at 11:46:26AM -0800, Sam Rash alleged:
> In all this I'm a bit lost--how can I rely on a host that's not on (it burnt
> to a crisp) to set itself as 'down'?  the server clearly has to have some
> logic to set hosts as down (even if it is a simple tcp connect/ping).
> Having a node that is 'there', but whose health check script fails set
> itself 'down' is fine as an option, but by default why separate out the
> logic?  Shouldn't a 'resource manager' manage the resources?

The health check logic doesn't replace the current "down" state becaue
the node isn't responding.  This is adding a second mechanism to set the
down bit.

Rest assured, if the node has turned into a pile of ash, then it will be
"down" because it won't be responding anymore.

 
> I concur using the exit code to indicate yes/no and then stderr with ERROR
> to indicate the verbosity is ideal.
> 
> I also had questions on pbsdsh that weren't clear; Is this what most people
> use to do parallel tasks?  It does not seem to support any mechanism to copy
> a program/binary/input/output for staging...

It is probably not accurate to say "most" people use pbsdsh, but it is
the easy way to launch remote processes within TM.

No, it doesn't copy files.  Just use scp/rcp or shared storage for that.




More information about the torqueusers mailing list