[torqueusers] No contact with server at hostaddr problem

Curtis Wensley curtis.wensley at us.cd-adapco.com
Wed Jan 10 07:42:39 MST 2007


I've been trying to figure this out but nothing I do fixes my problem. 
I have a 20 node cluster and whenever a job gets queued to node16 the
node goes from the status of "free" to "down".  My headnode does not
show any problems in its' logs, but node 16 shows there is a problem. 
The following is what I'm getting from the mom_logs:


01/04/2007 10:22:47;0100;   pbs_mom;Req;;Type StatusJob request received
from PBS_Server at host.cluster.com, sock=11
01/04/2007 10:22:47;0002;   pbs_mom;n/a;mom_main;connection to server
host timeout
01/04/2007 10:22:47;0002;   pbs_mom;n/a;mom_main;hello sent to server host
01/04/2007 10:23:03;0080;   pbs_mom;Req;jobobit;No contact with server
at hostaddr ac2800fa, port 15001, jobid 330.host.cluster.com errno 111
....
The last line keeps repeating until I delete the job from the queue.

I don't understand what the hostaddr ac2800fa is referring to.  I can
ping host.cluster.com so it is not a network problem.  Any help will be
appreciated.

-- 
Curtis Wensley




More information about the torqueusers mailing list