[torqueusers] Server does not detect node state change for job initiated on that ndoe

David Singleton David.Singleton at anu.edu.au
Mon Feb 14 14:36:05 MST 2005


If it was just the MOM that had died, you probably wouldn't want the job
deleted because it's probably running quite happily.  So the current
behaviour is OK in that case.

What is needed is some definitive way of saying the node is dead.  With
that info the PBS server could do something useful.  But as you saw the
"deadness" of a node is sometimes ill-defined.  ganglia thought your node
was still alive.

David

David Osguthorpe wrote:
> Just had an interesting issue with a job - it was started (i.e. primary mom)
> on a node which then became hung in someway (you could contact it but it
> refused all forms of login - on the otherhand network daemons on the node were
> still pumping out data - e.g. the ganglia monitor was still receiving packets
> from it) - although the PBS server quickly detected that it could not contact the node
> the job just stayed in the queue with all allocated resources and it could not be
> killed - and from qstat it just said it was running - although after careful inspection
> you could see it was not incrementing any wall time - I actually only noticed this
> many hours later when I passed by the machine room and saw a set of nodes
> which had no load yet supposedly all nodes should have been working
> 
> (Im running torque-1.1.0p6-snap.1105412901 on an OSX XServe G5)
> 
> output in the server_logs directory at the time of start of job 725 whose primary
> slave node was nd071
> 
> 02/11/2005 11:00:14;0008;PBS_Server;Job;725.simwulf.uchsc.edu;Job Modified at request of root at simwulf.uchsc.edu
> 02/11/2005 11:00:14;0008;PBS_Server;Job;725.simwulf.uchsc.edu;Job Run at request of root at simwulf.uchsc.edu
> 02/11/2005 11:00:14;0008;PBS_Server;Job;725.simwulf.uchsc.edu;Job Modified at request of root at simwulf.uchsc.edu
> 02/11/2005 11:17:12;0004;PBS_Server;Svr;check_nodes;node nd071 not detected in 747 seconds, marking node down
> 
> as the server clearly knew this node was down I dont understand why something didnt happen to the job
> 
> the only time something happened to the job was when I rebooted the node - after that the job exited
> 
> the log entries at that time was:
> 
> 02/11/2005 18:07:22;0010;PBS_Server;Job;725.simwulf.uchsc.edu;Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb resources_used.walltime=07:07:03
> 02/11/2005 18:07:22;000d;PBS_Server;Job;725.simwulf.uchsc.edu;Post job file processing error; job 725.simwulf.uchsc.edu on host nd071/1+nd071/0+nd072/1+nd072/0+nd073/1+nd073/0+nd074/1+nd074/0+nd075/1+nd075/0+nd077/1+nd077/0+nd078/1+nd078/0+nd079/1+nd079/0+nd080/1+nd080/0+nd081/1+nd081/0+nd082/1+nd082/0+nd083/1+nd083/0+nd084/1+nd084/0+nd085/1+nd085/0+nd086/1+nd086/0+nd087/1+nd087/0+nd088/1+nd088/0+nd089/1+nd089/0+nd090/1+nd090/0+nd091/1+nd091/0+nd092/1+nd092/0
> 
> other than that there was no entry for job 725 in the PBS server log - of course I couldnt see what was in the nd071 mom log
> until after reboot
> 
> The behaviour I would like is that the server should have done something to job 725 when it detected nd071 was down
> whatever the state of the primary mom node - and at least released the other nodes for other work
> 
> Thanks
> 
> David
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://supercluster.org/mailman/listinfo/torqueusers


-- 
--------------------------------------------------------------------------
                                     ANU Supercomputer Facility
    David.Singleton at anu.edu.au       and APAC National Facility
    Phone: +61 2 6125 4389           Leonard Huxley Bldg (No. 56)
    Fax:   +61 2 6125 8199           Australian National University
                                     Canberra, ACT, 0200, Australia
--------------------------------------------------------------------------


More information about the torqueusers mailing list