[torqueusers] mom log error "cannot connect to port 1023 in client_to_svr"

Ken Nielson knielson at adaptivecomputing.com
Mon Feb 22 14:08:58 MST 2010


Steve,

This error is coming from the function client_to_svr after a call to 
connect. Connect is failing with an ECONNREFUSED. You might want to try 
netstat and see if 1023 is even available.

Ken Nielson
Adaptive Computing

Steve Young wrote:
> Hi all,
> 	I'm running torque version 2.4.3.  I seem to be getting a strange  
> error message in the mom client logs on one of my new Altix nodes:
>
> 02/22/2010 13:27:33;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation  
> now in progress (115) in scan_for_exiting, cannot connect to port 1023  
> in client_to_svr - connection refused
> 02/22/2010 13:27:34;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation  
> now in progress (115) in scan_for_exiting, cannot connect to port 1023  
> in client_to_svr - connection refused
> 02/22/2010 13:27:35;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation  
> now in progress (115) in scan_for_exiting, cannot connect to port 1023  
> in client_to_svr - connection refused
> 02/22/2010 13:27:36;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation  
> now in progress (115) in scan_for_exiting, cannot connect to port 1023  
> in client_to_svr - connection refused
>
> This continually repeats itself.
>
> Let me explain that this new client is in a remote data center and  
> we've been having to dork around with firewall rules and such to make  
> this work. I originally thought it was a firewall issue here too but  
> it seems as if the connections are indeed going through when I sniff  
> the interfaces on each host:
>
> On the qserver I see:
>
> [root at qserver server_logs]# tcpdump -i eth0 port 1023
> tcpdump: verbose output suppressed, use -v or -vv for full protocol  
> decode
> listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
> 12:50:02.812206 IP clienthost.1023 > qserver.15001: UDP, length 411
> 12:50:02.812423 IP qserver.15001 > clienthost.1023: UDP, length 26
>
>
> and on the clienthost I also see:
>
> clienthost:/var/spool/torque/mom_logs # tcpdump -i eth0 port 1023
> tcpdump: verbose output suppressed, use -v or -vv for full protocol  
> decode
> listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
> 12:50:07.356773 IP clienthost.1023 > qserver.15001: UDP, length 411
> 12:50:07.411752 IP qserver.15001 > clienthost.1023: UDP, length 26
> 12:50:52.974771 IP clienthost.1023 > qserver.15001: UDP, length 411
> 12:50:53.030257 IP qserver.15001 > clienthost.1023: UDP, length 26
>
>
> I also am seeing the server being able to connect to the client with  
> pbsnodes command:
>
> [root at qserver torque]# pbsnodes -a clienthost
> clienthost
>       state = free
>       np = 128
>       properties = altix
>       ntype = cluster
>       jobs = 0/774702.qserver
>       status = opsys=linux,uname=Linux clienthost 2.6.16.60-0.42.5- 
> default #1 SMP Mon Aug 24 09:41:41 UTC 2009 ia64,sessions=?  
> 0,nsessions=?  
> 0 
> ,nusers 
> = 
> 0 
> ,idletime 
> = 
> 5114 
> ,totmem 
> = 
> 652464944kb 
> ,availmem 
> = 
> 649426768kb 
> ,physmem 
> = 
> 644652560kb 
> ,ncpus 
> = 
> 128 
> ,loadave 
> = 
> 0.00 
> ,netload 
> =19913039,state=free,jobs=774702.qserver,varattr=,rectime=1266866056
>
> In fact I am trying to run a job on the host but I also get this  
> message in the maui checkjob command:
>
> Messages:  cannot start job - RM failure, rc: 15082, msg: 'Premature  
> end of message'
>
> I don't see what an error code 15082 is in the maui manual. The job  
> goes into the running state but never actually starts running on the  
> host. At first the server keeps trying to rerun the job until it gets  
> started. Once the server thinks it is running the job goes off into la  
> la land and I have to use a qdel -p to remove it.  So while I still  
> might be inclined to think something was up with the firewall I  
> thought I might ask here to see what others have seen with this.  
> Perhaps I need to change some tcp timeouts or something. On the client  
> I'm able to run qmgr and run qstat commands so I know I am close to  
> getting this all working properly. Any idea's what else I should look  
> at? Let me know if you need better clarification on anything. Thanks,
>
> -Steve
>
>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>   



More information about the torqueusers mailing list