[torqueusers] Server does not detect node state change for job initiated on that ndoe

David Osguthorpe David.Osguthorpe at uchsc.edu
Mon Feb 14 09:30:31 MST 2005


Just had an interesting issue with a job - it was started (i.e. primary mom)
on a node which then became hung in someway (you could contact it but it
refused all forms of login - on the otherhand network daemons on the node were
still pumping out data - e.g. the ganglia monitor was still receiving packets
from it) - although the PBS server quickly detected that it could not contact the node
the job just stayed in the queue with all allocated resources and it could not be
killed - and from qstat it just said it was running - although after careful inspection
you could see it was not incrementing any wall time - I actually only noticed this
many hours later when I passed by the machine room and saw a set of nodes
which had no load yet supposedly all nodes should have been working

(Im running torque-1.1.0p6-snap.1105412901 on an OSX XServe G5)

output in the server_logs directory at the time of start of job 725 whose primary
slave node was nd071

02/11/2005 11:00:14;0008;PBS_Server;Job;725.simwulf.uchsc.edu;Job Modified at request of root at simwulf.uchsc.edu
02/11/2005 11:00:14;0008;PBS_Server;Job;725.simwulf.uchsc.edu;Job Run at request of root at simwulf.uchsc.edu
02/11/2005 11:00:14;0008;PBS_Server;Job;725.simwulf.uchsc.edu;Job Modified at request of root at simwulf.uchsc.edu
02/11/2005 11:17:12;0004;PBS_Server;Svr;check_nodes;node nd071 not detected in 747 seconds, marking node down

as the server clearly knew this node was down I dont understand why something didnt happen to the job

the only time something happened to the job was when I rebooted the node - after that the job exited

the log entries at that time was:

02/11/2005 18:07:22;0010;PBS_Server;Job;725.simwulf.uchsc.edu;Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb resources_used.walltime=07:07:03
02/11/2005 18:07:22;000d;PBS_Server;Job;725.simwulf.uchsc.edu;Post job file processing error; job 725.simwulf.uchsc.edu on host nd071/1+nd071/0+nd072/1+nd072/0+nd073/1+nd073/0+nd074/1+nd074/0+nd075/1+nd075/0+nd077/1+nd077/0+nd078/1+nd078/0+nd079/1+nd079/0+nd080/1+nd080/0+nd081/1+nd081/0+nd082/1+nd082/0+nd083/1+nd083/0+nd084/1+nd084/0+nd085/1+nd085/0+nd086/1+nd086/0+nd087/1+nd087/0+nd088/1+nd088/0+nd089/1+nd089/0+nd090/1+nd090/0+nd091/1+nd091/0+nd092/1+nd092/0

other than that there was no entry for job 725 in the PBS server log - of course I couldnt see what was in the nd071 mom log
until after reboot

The behaviour I would like is that the server should have done something to job 725 when it detected nd071 was down
whatever the state of the primary mom node - and at least released the other nodes for other work

Thanks

David


More information about the torqueusers mailing list