[torqueusers] Server does not detect node state change for job initiated on that ndoe

Chris Samuel csamuel at vpac.org
Thu Feb 17 15:02:54 MST 2005

On Tue, 15 Feb 2005 08:36 am, David Singleton wrote:

Hi David! :-)

> If it was just the MOM that had died, you probably wouldn't want the job
> deleted because it's probably running quite happily.  So the current
> behaviour is OK in that case.

Agreed, this also matters in the case of something like an ethernet network 
failure on a cluster with some other interconnect, where the parallel jobs 
could carry on quite happily whilst the pbs_server is unable to talk to the 
mom.  No point marking the job dead until you can say for certain it's gone.

This is a bug in a certain piece of commercial software which submits jobs to 
PBS, it assumes that if a qstat fails (pbs_server down, network problem, etc) 
then all the PBS jobs have failed, which is patently wrong.

> What is needed is some definitive way of saying the node is dead.  With
> that info the PBS server could do something useful.  But as you saw the
> "deadness" of a node is sometimes ill-defined.  ganglia thought your node
> was still alive.

The only thing I can think of would be for some way for the mom to know that 
it's first time a mom has been run since a reboot.  But even that is hairy 
and wouldn't buy you anything more than than waiting for the pbs_server to 
poll it about the jobs it used to have..

I guess the only sure-fire way would be to have another option to the pbsnodes 
command to allow the admin (or a script if they feel adventurous enough) to 
mark a node as completely dead, and to have the pbs_server act as if it had 
talked to the mom and confirmed they're gone.

 Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin
 Victorian Partnership for Advanced Computing http://www.vpac.org/
 Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050218/0d37a9cf/attachment.bin

More information about the torqueusers mailing list