I had a similar instance with torque-1.1.0p6 on a linux opteron cluster,
where node104 went away on 02/11/2005

02/11/2005 23:11:17;0004;PBS_Server;Svr;check_nodes;node node104 not
detected in 320 seconds, marking node down.

Two days later, the job ran out of time and maui tried to delete it, but the
deletion did not work.  So, every maui cycle maui would try to delete the
job, the deletion of the job would fail, and an email would be sent to the
user.  Pbsnodes -l gave that the node was down, I was surprised that maui
did not know that the node was down.


On 2/14/05 9:30 AM, "David Osguthorpe" <David.Osguthorpe at uchsc.edu> wrote:

> Just had an interesting issue with a job - it was started (i.e. primary mom)
> on a node which then became hung in someway (you could contact it but it
> refused all forms of login - on the otherhand network daemons on the node were
> still pumping out data - e.g. the ganglia monitor was still receiving packets
> from it) - although the PBS server quickly detected that it could not contact
> the node
> the job just stayed in the queue with all allocated resources and it could not
> be
> killed - and from qstat it just said it was running - although after careful
> inspection
> you could see it was not incrementing any wall time - I actually only noticed
> this
> many hours later when I passed by the machine room and saw a set of nodes
> which had no load yet supposedly all nodes should have been working
> (Im running torque-1.1.0p6-snap.1105412901 on an OSX XServe G5)
> output in the server_logs directory at the time of start of job 725 whose
> primary
> slave node was nd071
> 02/11/2005 11:00:14;0008;PBS_Server;Job;725.simwulf.uchsc.edu;Job Modified at
> request of root at simwulf.uchsc.edu
> 02/11/2005 11:00:14;0008;PBS_Server;Job;725.simwulf.uchsc.edu;Job Run at
> request of root at simwulf.uchsc.edu
> 02/11/2005 11:00:14;0008;PBS_Server;Job;725.simwulf.uchsc.edu;Job Modified at
> request of root at simwulf.uchsc.edu
> 02/11/2005 11:17:12;0004;PBS_Server;Svr;check_nodes;node nd071 not detected in
> 747 seconds, marking node down
> as the server clearly knew this node was down I dont understand why something
> didnt happen to the job
> the only time something happened to the job was when I rebooted the node -
> after that the job exited
> the log entries at that time was:
> 02/11/2005 18:07:22;0010;PBS_Server;Job;725.simwulf.uchsc.edu;Exit_status=0
> resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb
> resources_used.walltime=07:07:03
> 02/11/2005 18:07:22;000d;PBS_Server;Job;725.simwulf.uchsc.edu;Post job file
> processing error; job 725.simwulf.uchsc.edu on host
> nd071/1+nd071/0+nd072/1+nd072/0+nd073/1+nd073/0+nd074/1+nd074/0+nd075/1+nd075/
> 0+nd077/1+nd077/0+nd078/1+nd078/0+nd079/1+nd079/0+nd080/1+nd080/0+nd081/1+nd08
> 1/0+nd082/1+nd082/0+nd083/1+nd083/0+nd084/1+nd084/0+nd085/1+nd085/0+nd086/1+nd
> 086/0+nd087/1+nd087/0+nd088/1+nd088/0+nd089/1+nd089/0+nd090/1+nd090/0+nd091/1+
> nd091/0+nd092/1+nd092/0
> other than that there was no entry for job 725 in the PBS server log - of
> course I couldnt see what was in the nd071 mom log
> until after reboot
> The behaviour I would like is that the server should have done something to
> job 725 when it detected nd071 was down
> whatever the state of the primary mom node - and at least released the other
> nodes for other work
> Thanks
> David
