[torqueusers] pbs_mom No Route To Host
Adam Fedor
fedor at qwestoffice.net
Sat Nov 22 12:06:58 MST 2008
I have installed torque 2.1.10 on a set of Fedora system (via rpm),
and I'm having trouble with the client nodes communicating back to the
head node. I can submit and run jobs on the head node fine. But if I
submit a job that goes to another node, it will run on that other
node, but nothing about the job finishing or job output will get sent
back, so the head node thinks the job is still running. Also, the
head node can 'see' the other nodes, e.g. with pbsnodes -a:
clthps1 ~>pbsnodes -a
[.... LINES DELETED FOR CLARITY ....]
clthps4.clt.internal
state = free
np = 1
ntype = cluster
status = opsys=linux,uname=Linux clthps4.clt.internal
2.6.25-14.fc9.x86_64 #1 SMP Thu May 1 06:06:21 EDT 2008
x86_64,sessions=? 15201,nsessions=?
15201
,nusers
=
0
,idletime
=
92405
,totmem
=
24661640kb
,availmem
=
24233792kb
,physmem
=
16468500kb
,ncpus
=
16
,loadave
=0.00,netload=2303793036,state=free,jobs=66.clthps1,rectime=1227282361
But the client nodes can't seem to talk to the head node:
clthps4 ~>pbsnodes -a
No route to host
pbsnodes: cannot connect to server clthps1.clt.internal, error=113
and in the mom logs I get many lines of errors like this:
11/21/2008 10:50:26;0080; pbs_mom;Req;jobobit;No contact with server
at hostaddr ac143220, port 15001, jobid 66.clthps1 errno 113
I can ssh back and forth to all the machines (without a password but
with a passphrase), forward and reverse DNS seems to work (Using the
host command, although I don't know if that's authoritative or not).
All the ports that the daemons are running on seem right (I didn't not
change any port configuration). I think my IT guys set up a bonded
interface to the machines, so perhaps that could cause a problem? Any
other things I can check? Thanks.
More information about the torqueusers
mailing list