[torqueusers] mom log error "cannot connect to port 1023 in client_to_svr"
Ken Nielson
knielson at adaptivecomputing.com
Mon Feb 22 14:08:58 MST 2010
Steve,
This error is coming from the function client_to_svr after a call to
connect. Connect is failing with an ECONNREFUSED. You might want to try
netstat and see if 1023 is even available.
Ken Nielson
Adaptive Computing
Steve Young wrote:
> Hi all,
> I'm running torque version 2.4.3. I seem to be getting a strange
> error message in the mom client logs on one of my new Altix nodes:
>
> 02/22/2010 13:27:33;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation
> now in progress (115) in scan_for_exiting, cannot connect to port 1023
> in client_to_svr - connection refused
> 02/22/2010 13:27:34;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation
> now in progress (115) in scan_for_exiting, cannot connect to port 1023
> in client_to_svr - connection refused
> 02/22/2010 13:27:35;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation
> now in progress (115) in scan_for_exiting, cannot connect to port 1023
> in client_to_svr - connection refused
> 02/22/2010 13:27:36;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation
> now in progress (115) in scan_for_exiting, cannot connect to port 1023
> in client_to_svr - connection refused
>
> This continually repeats itself.
>
> Let me explain that this new client is in a remote data center and
> we've been having to dork around with firewall rules and such to make
> this work. I originally thought it was a firewall issue here too but
> it seems as if the connections are indeed going through when I sniff
> the interfaces on each host:
>
> On the qserver I see:
>
> [root at qserver server_logs]# tcpdump -i eth0 port 1023
> tcpdump: verbose output suppressed, use -v or -vv for full protocol
> decode
> listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
> 12:50:02.812206 IP clienthost.1023 > qserver.15001: UDP, length 411
> 12:50:02.812423 IP qserver.15001 > clienthost.1023: UDP, length 26
>
>
> and on the clienthost I also see:
>
> clienthost:/var/spool/torque/mom_logs # tcpdump -i eth0 port 1023
> tcpdump: verbose output suppressed, use -v or -vv for full protocol
> decode
> listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
> 12:50:07.356773 IP clienthost.1023 > qserver.15001: UDP, length 411
> 12:50:07.411752 IP qserver.15001 > clienthost.1023: UDP, length 26
> 12:50:52.974771 IP clienthost.1023 > qserver.15001: UDP, length 411
> 12:50:53.030257 IP qserver.15001 > clienthost.1023: UDP, length 26
>
>
> I also am seeing the server being able to connect to the client with
> pbsnodes command:
>
> [root at qserver torque]# pbsnodes -a clienthost
> clienthost
> state = free
> np = 128
> properties = altix
> ntype = cluster
> jobs = 0/774702.qserver
> status = opsys=linux,uname=Linux clienthost 2.6.16.60-0.42.5-
> default #1 SMP Mon Aug 24 09:41:41 UTC 2009 ia64,sessions=?
> 0,nsessions=?
> 0
> ,nusers
> =
> 0
> ,idletime
> =
> 5114
> ,totmem
> =
> 652464944kb
> ,availmem
> =
> 649426768kb
> ,physmem
> =
> 644652560kb
> ,ncpus
> =
> 128
> ,loadave
> =
> 0.00
> ,netload
> =19913039,state=free,jobs=774702.qserver,varattr=,rectime=1266866056
>
> In fact I am trying to run a job on the host but I also get this
> message in the maui checkjob command:
>
> Messages: cannot start job - RM failure, rc: 15082, msg: 'Premature
> end of message'
>
> I don't see what an error code 15082 is in the maui manual. The job
> goes into the running state but never actually starts running on the
> host. At first the server keeps trying to rerun the job until it gets
> started. Once the server thinks it is running the job goes off into la
> la land and I have to use a qdel -p to remove it. So while I still
> might be inclined to think something was up with the firewall I
> thought I might ask here to see what others have seen with this.
> Perhaps I need to change some tcp timeouts or something. On the client
> I'm able to run qmgr and run qstat commands so I know I am close to
> getting this all working properly. Any idea's what else I should look
> at? Let me know if you need better clarification on anything. Thanks,
>
> -Steve
>
>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
More information about the torqueusers
mailing list