[torqueusers] Server does not detect node state change for job initiated on that ndoe

David Osguthorpe David.Osguthorpe at uchsc.edu
Thu Feb 17 20:01:55 MST 2005


On Fri, Feb 18, 2005 at 09:02:54AM +1100, Chris Samuel wrote:
> 
> > If it was just the MOM that had died, you probably wouldn't want the job
> > deleted because it's probably running quite happily. ?So the current
> > behaviour is OK in that case.
> 
> Agreed, this also matters in the case of something like an ethernet network 
> failure on a cluster with some other interconnect, where the parallel jobs 
> could carry on quite happily whilst the pbs_server is unable to talk to the 
> mom.  No point marking the job dead until you can say for certain it's gone.
> 

The question is which is more likely - do you loose more work by killing jobs
that may still be working under certain MOM/node fault conditions, particularly
faults on the primary mother superior MOM - compared to that
lost in the current situation where jobs really are not working even though
PBS thinks they are (which means all the jobs nodes are really idle)
- and there is no notification that there is a problem
- so far Ive only seen the second occurrence - and note that as far as I can
see the job would have remained in the system forever - because the walltime
was not being updated on the server - which would have locked those nodes
out forever

at the minimum under the situation I had where the server knew it had lost
contact to the primary mother superior node/MOM the status in the qstat should
change from R to something else e.g. ? or U (unknown/undetermined)
or I (indeterminate)
another option would be to e-mail the admin user if contact is lost to a
primary mother superior node/MOM (but not for contact lost to other slave nodes)

maybe this should be a configurable option for the server to be able to delete
and remove jobs if the primary mother superior MOM is not contactable - it seems
under the current torque system all job clean up etc. is delegated to the primary
mother superior MOM so there are multiple problems if the server looses contact
with the primary MOM/node e.g. the infinite e-mails to the user as the PBS server
tries to delete the job but never can because the primary MOM is not there when
the job exceeds its walltime

02/02/2005 05:07:32;0008;PBS_Server;Job;499.simwulf.uchsc.edu;Job deleted at request of root at simwulf.uchsc.edu
02/02/2005 05:07:32;0001;PBS_Server;Req;;Server could not connect to MOM
02/02/2005 05:07:32;0008;PBS_Server;Job;499.simwulf.uchsc.edu;Job sent signal SIGTERM on delete
02/02/2005 05:08:01;0008;PBS_Server;Job;499.simwulf.uchsc.edu;Job deleted at request of root at simwulf.uchsc.edu
02/02/2005 05:08:01;0001;PBS_Server;Req;;Server could not connect to MOM
02/02/2005 05:08:01;0008;PBS_Server;Job;499.simwulf.uchsc.edu;Job sent signal SIGTERM on delete
02/02/2005 05:08:32;0008;PBS_Server;Job;499.simwulf.uchsc.edu;Job deleted at request of root at simwulf.uchsc.edu
02/02/2005 05:08:32;0001;PBS_Server;Req;;Server could not connect to MOM
02/02/2005 05:08:32;0008;PBS_Server;Job;499.simwulf.uchsc.edu;Job sent signal SIGTERM on delete
02/02/2005 05:09:01;0008;PBS_Server;Job;499.simwulf.uchsc.edu;Job deleted at request of root at simwulf.uchsc.edu
02/02/2005 05:09:01;0001;PBS_Server;Req;;Server could not connect to MOM
02/02/2005 05:09:01;0008;PBS_Server;Job;499.simwulf.uchsc.edu;Job sent signal SIGTERM on delete
etc.

- the server should probably allow the execution of an "epilogue" script similar
to what the MOM would do

David O.


More information about the torqueusers mailing list