[torqueusers] Server does not detect node state change for job initiated on that ndoe

Garrick Staples garrick at usc.edu
Mon Feb 14 15:22:30 MST 2005

If my own changelog entries are correct, that bug is present in 1.1.0p6 final
and is triggered by a long prologue (or some sort of system hang that would
cause a prologue to take too long); and is fixed in an early 1.2.0b0 snapshot.

I think you want this for 1.1.0p6:

On Mon, Feb 14, 2005 at 09:30:31AM -0700, David Osguthorpe alleged:
> Just had an interesting issue with a job - it was started (i.e. primary mom)
> on a node which then became hung in someway (you could contact it but it
> refused all forms of login - on the otherhand network daemons on the node were
> still pumping out data - e.g. the ganglia monitor was still receiving packets
> from it) - although the PBS server quickly detected that it could not contact the node
> the job just stayed in the queue with all allocated resources and it could not be
> killed - and from qstat it just said it was running - although after careful inspection
> you could see it was not incrementing any wall time - I actually only noticed this
> many hours later when I passed by the machine room and saw a set of nodes
> which had no load yet supposedly all nodes should have been working
> (Im running torque-1.1.0p6-snap.1105412901 on an OSX XServe G5)
> output in the server_logs directory at the time of start of job 725 whose primary
> slave node was nd071
> 02/11/2005 11:00:14;0008;PBS_Server;Job;725.simwulf.uchsc.edu;Job Modified at request of root at simwulf.uchsc.edu
> 02/11/2005 11:00:14;0008;PBS_Server;Job;725.simwulf.uchsc.edu;Job Run at request of root at simwulf.uchsc.edu
> 02/11/2005 11:00:14;0008;PBS_Server;Job;725.simwulf.uchsc.edu;Job Modified at request of root at simwulf.uchsc.edu
> 02/11/2005 11:17:12;0004;PBS_Server;Svr;check_nodes;node nd071 not detected in 747 seconds, marking node down
> as the server clearly knew this node was down I dont understand why something didnt happen to the job
> the only time something happened to the job was when I rebooted the node - after that the job exited
> the log entries at that time was:
> 02/11/2005 18:07:22;0010;PBS_Server;Job;725.simwulf.uchsc.edu;Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb resources_used.walltime=07:07:03
> 02/11/2005 18:07:22;000d;PBS_Server;Job;725.simwulf.uchsc.edu;Post job file processing error; job 725.simwulf.uchsc.edu on host nd071/1+nd071/0+nd072/1+nd072/0+nd073/1+nd073/0+nd074/1+nd074/0+nd075/1+nd075/0+nd077/1+nd077/0+nd078/1+nd078/0+nd079/1+nd079/0+nd080/1+nd080/0+nd081/1+nd081/0+nd082/1+nd082/0+nd083/1+nd083/0+nd084/1+nd084/0+nd085/1+nd085/0+nd086/1+nd086/0+nd087/1+nd087/0+nd088/1+nd088/0+nd089/1+nd089/0+nd090/1+nd090/0+nd091/1+nd091/0+nd092/1+nd092/0
> other than that there was no entry for job 725 in the PBS server log - of course I couldnt see what was in the nd071 mom log
> until after reboot
> The behaviour I would like is that the server should have done something to job 725 when it detected nd071 was down
> whatever the state of the primary mom node - and at least released the other nodes for other work
> Thanks
> David
Garrick Staples, Linux/HPCC Administrator
University of Southern California
