[torqueusers] mom log error "cannot connect to port 1023 in client_to_svr"
Steve Young
chemadm at hamilton.edu
Mon Feb 22 12:26:37 MST 2010
Hi all,
I'm running torque version 2.4.3. I seem to be getting a strange
error message in the mom client logs on one of my new Altix nodes:
02/22/2010 13:27:33;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation
now in progress (115) in scan_for_exiting, cannot connect to port 1023
in client_to_svr - connection refused
02/22/2010 13:27:34;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation
now in progress (115) in scan_for_exiting, cannot connect to port 1023
in client_to_svr - connection refused
02/22/2010 13:27:35;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation
now in progress (115) in scan_for_exiting, cannot connect to port 1023
in client_to_svr - connection refused
02/22/2010 13:27:36;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation
now in progress (115) in scan_for_exiting, cannot connect to port 1023
in client_to_svr - connection refused
This continually repeats itself.
Let me explain that this new client is in a remote data center and
we've been having to dork around with firewall rules and such to make
this work. I originally thought it was a firewall issue here too but
it seems as if the connections are indeed going through when I sniff
the interfaces on each host:
On the qserver I see:
[root at qserver server_logs]# tcpdump -i eth0 port 1023
tcpdump: verbose output suppressed, use -v or -vv for full protocol
decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
12:50:02.812206 IP clienthost.1023 > qserver.15001: UDP, length 411
12:50:02.812423 IP qserver.15001 > clienthost.1023: UDP, length 26
and on the clienthost I also see:
clienthost:/var/spool/torque/mom_logs # tcpdump -i eth0 port 1023
tcpdump: verbose output suppressed, use -v or -vv for full protocol
decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
12:50:07.356773 IP clienthost.1023 > qserver.15001: UDP, length 411
12:50:07.411752 IP qserver.15001 > clienthost.1023: UDP, length 26
12:50:52.974771 IP clienthost.1023 > qserver.15001: UDP, length 411
12:50:53.030257 IP qserver.15001 > clienthost.1023: UDP, length 26
I also am seeing the server being able to connect to the client with
pbsnodes command:
[root at qserver torque]# pbsnodes -a clienthost
clienthost
state = free
np = 128
properties = altix
ntype = cluster
jobs = 0/774702.qserver
status = opsys=linux,uname=Linux clienthost 2.6.16.60-0.42.5-
default #1 SMP Mon Aug 24 09:41:41 UTC 2009 ia64,sessions=?
0,nsessions=?
0
,nusers
=
0
,idletime
=
5114
,totmem
=
652464944kb
,availmem
=
649426768kb
,physmem
=
644652560kb
,ncpus
=
128
,loadave
=
0.00
,netload
=19913039,state=free,jobs=774702.qserver,varattr=,rectime=1266866056
In fact I am trying to run a job on the host but I also get this
message in the maui checkjob command:
Messages: cannot start job - RM failure, rc: 15082, msg: 'Premature
end of message'
I don't see what an error code 15082 is in the maui manual. The job
goes into the running state but never actually starts running on the
host. At first the server keeps trying to rerun the job until it gets
started. Once the server thinks it is running the job goes off into la
la land and I have to use a qdel -p to remove it. So while I still
might be inclined to think something was up with the firewall I
thought I might ask here to see what others have seen with this.
Perhaps I need to change some tcp timeouts or something. On the client
I'm able to run qmgr and run qstat commands so I know I am close to
getting this all working properly. Any idea's what else I should look
at? Let me know if you need better clarification on anything. Thanks,
-Steve
More information about the torqueusers
mailing list