[torqueusers] Server does not detect node state change for job
initiated on that ndoe
csamuel at vpac.org
Thu Feb 17 15:02:54 MST 2005
On Tue, 15 Feb 2005 08:36 am, David Singleton wrote:
Hi David! :-)
> If it was just the MOM that had died, you probably wouldn't want the job
> deleted because it's probably running quite happily. So the current
> behaviour is OK in that case.
Agreed, this also matters in the case of something like an ethernet network
failure on a cluster with some other interconnect, where the parallel jobs
could carry on quite happily whilst the pbs_server is unable to talk to the
mom. No point marking the job dead until you can say for certain it's gone.
This is a bug in a certain piece of commercial software which submits jobs to
PBS, it assumes that if a qstat fails (pbs_server down, network problem, etc)
then all the PBS jobs have failed, which is patently wrong.
> What is needed is some definitive way of saying the node is dead. With
> that info the PBS server could do something useful. But as you saw the
> "deadness" of a node is sometimes ill-defined. ganglia thought your node
> was still alive.
The only thing I can think of would be for some way for the mom to know that
it's first time a mom has been run since a reboot. But even that is hairy
and wouldn't buy you anything more than than waiting for the pbs_server to
poll it about the jobs it used to have..
I guess the only sure-fire way would be to have another option to the pbsnodes
command to allow the admin (or a script if they feel adventurous enough) to
mark a node as completely dead, and to have the pbs_server act as if it had
talked to the mom and confirmed they're gone.
Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin
Victorian Partnership for Advanced Computing http://www.vpac.org/
Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050218/0d37a9cf/attachment.bin
More information about the torqueusers