[torqueusers] mom log error "cannot connect to port 1023 in client_to_svr"

Steve Young chemadm at hamilton.edu
Mon Feb 22 12:26:37 MST 2010


Hi all,
	I'm running torque version 2.4.3.  I seem to be getting a strange  
error message in the mom client logs on one of my new Altix nodes:

02/22/2010 13:27:33;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation  
now in progress (115) in scan_for_exiting, cannot connect to port 1023  
in client_to_svr - connection refused
02/22/2010 13:27:34;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation  
now in progress (115) in scan_for_exiting, cannot connect to port 1023  
in client_to_svr - connection refused
02/22/2010 13:27:35;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation  
now in progress (115) in scan_for_exiting, cannot connect to port 1023  
in client_to_svr - connection refused
02/22/2010 13:27:36;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation  
now in progress (115) in scan_for_exiting, cannot connect to port 1023  
in client_to_svr - connection refused

This continually repeats itself.

Let me explain that this new client is in a remote data center and  
we've been having to dork around with firewall rules and such to make  
this work. I originally thought it was a firewall issue here too but  
it seems as if the connections are indeed going through when I sniff  
the interfaces on each host:

On the qserver I see:

[root at qserver server_logs]# tcpdump -i eth0 port 1023
tcpdump: verbose output suppressed, use -v or -vv for full protocol  
decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
12:50:02.812206 IP clienthost.1023 > qserver.15001: UDP, length 411
12:50:02.812423 IP qserver.15001 > clienthost.1023: UDP, length 26


and on the clienthost I also see:

clienthost:/var/spool/torque/mom_logs # tcpdump -i eth0 port 1023
tcpdump: verbose output suppressed, use -v or -vv for full protocol  
decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
12:50:07.356773 IP clienthost.1023 > qserver.15001: UDP, length 411
12:50:07.411752 IP qserver.15001 > clienthost.1023: UDP, length 26
12:50:52.974771 IP clienthost.1023 > qserver.15001: UDP, length 411
12:50:53.030257 IP qserver.15001 > clienthost.1023: UDP, length 26


I also am seeing the server being able to connect to the client with  
pbsnodes command:

[root at qserver torque]# pbsnodes -a clienthost
clienthost
      state = free
      np = 128
      properties = altix
      ntype = cluster
      jobs = 0/774702.qserver
      status = opsys=linux,uname=Linux clienthost 2.6.16.60-0.42.5- 
default #1 SMP Mon Aug 24 09:41:41 UTC 2009 ia64,sessions=?  
0,nsessions=?  
0 
,nusers 
= 
0 
,idletime 
= 
5114 
,totmem 
= 
652464944kb 
,availmem 
= 
649426768kb 
,physmem 
= 
644652560kb 
,ncpus 
= 
128 
,loadave 
= 
0.00 
,netload 
=19913039,state=free,jobs=774702.qserver,varattr=,rectime=1266866056

In fact I am trying to run a job on the host but I also get this  
message in the maui checkjob command:

Messages:  cannot start job - RM failure, rc: 15082, msg: 'Premature  
end of message'

I don't see what an error code 15082 is in the maui manual. The job  
goes into the running state but never actually starts running on the  
host. At first the server keeps trying to rerun the job until it gets  
started. Once the server thinks it is running the job goes off into la  
la land and I have to use a qdel -p to remove it.  So while I still  
might be inclined to think something was up with the firewall I  
thought I might ask here to see what others have seen with this.  
Perhaps I need to change some tcp timeouts or something. On the client  
I'm able to run qmgr and run qstat commands so I know I am close to  
getting this all working properly. Any idea's what else I should look  
at? Let me know if you need better clarification on anything. Thanks,

-Steve






More information about the torqueusers mailing list