[torqueusers] pbs_mom No Route To Host

Adam Fedor fedor at qwestoffice.net
Sat Nov 22 12:06:58 MST 2008


I have installed torque 2.1.10 on a set of Fedora system (via rpm),  
and I'm having trouble with the client nodes communicating back to the  
head node.  I can submit and run jobs on the head node fine. But if I  
submit a job that goes to another node, it will run on that other  
node, but nothing about the job finishing or job output will get sent  
back, so the head node thinks the job is still running.  Also, the  
head node can 'see' the other nodes, e.g. with pbsnodes -a:

clthps1 ~>pbsnodes -a
[.... LINES DELETED FOR CLARITY ....]
clthps4.clt.internal
     state = free
     np = 1
     ntype = cluster
     status = opsys=linux,uname=Linux clthps4.clt.internal  
2.6.25-14.fc9.x86_64 #1 SMP Thu May 1 06:06:21 EDT 2008  
x86_64,sessions=? 15201,nsessions=?  
15201 
,nusers 
= 
0 
,idletime 
= 
92405 
,totmem 
= 
24661640kb 
,availmem 
= 
24233792kb 
,physmem 
= 
16468500kb 
,ncpus 
= 
16 
,loadave 
=0.00,netload=2303793036,state=free,jobs=66.clthps1,rectime=1227282361

But the client nodes can't seem to talk to the head node:

clthps4 ~>pbsnodes -a
No route to host
pbsnodes: cannot connect to server clthps1.clt.internal, error=113

and in the mom logs I get many lines of errors like this:

11/21/2008 10:50:26;0080;   pbs_mom;Req;jobobit;No contact with server  
at hostaddr ac143220, port 15001, jobid 66.clthps1 errno 113

I can ssh back and forth to all the machines (without a password but  
with a passphrase), forward and reverse DNS seems to work (Using the  
host command, although I don't know if that's authoritative or not).  
All the ports that the daemons are running on seem right (I didn't not  
change any port configuration). I think my IT guys set up a bonded  
interface to the machines, so perhaps that could cause a problem? Any  
other things I can check? Thanks.


More information about the torqueusers mailing list