[torqueusers] mom log error "cannot connect to port 1023 in client_to_svr"
Steve Young
chemadm at hamilton.edu
Tue Feb 23 08:48:13 MST 2010
Hi Ken,
Thanks for the insight. I check both hosts with netstat and see this:
Proto Recv-Q Send-Q Local Address Foreign
Address State
udp 0 0 *:1023 *:*
Which would also support when I saw in/out traffic on the interfaces
when I sniffed them with tcpdump.
On my server I also see this in the logs:
02/23/2010 10:41:03;0004;PBS_Server;Svr;is_request;IS_STATUS received
from clienthost
So some communication is going through. Any other idea's? Thanks,
-Steve
Also on a side note I just noticed a problem with the mom_priv/config
options. I'm using torque 2.4.3 and when I tried to do the following
statements in the config file:
$size[fs=/scratch]
$arch ia64
$opsys SUSE10
Like the documentation on page ( http://www.clusterresources.com/products/torque/docs/a.cmomconfig.shtml
) suggests it complains with:
02/23/2010 09:40:05;0001;
pbs_mom;Svr;pbs_mom;LOG_ERROR::read_config, special command name size
not found (ignoring line)
02/23/2010 09:40:05;0001;
pbs_mom;Svr;pbs_mom;LOG_ERROR::read_config, special command name arch
not found (ignoring line)
02/23/2010 09:40:05;0001;
pbs_mom;Svr;pbs_mom;LOG_ERROR::read_config, special command name opsys
not found (ignoring line)
After I remove the $ in front of them then it works fine. In config:
size[fs=/scratch]
arch ia64
opsys SUSE10
in logs:
02/23/2010 09:41:27;0080; pbs_mom;n/a;add_static;config[4] add name
size value [fs=/scratch]
02/23/2010 09:41:27;0080; pbs_mom;n/a;add_static;config[5] add name
arch value ia64
02/23/2010 09:41:27;0080; pbs_mom;n/a;add_static;config[6] add name
opsys value SUSE10
On Feb 22, 2010, at 4:08 PM, Ken Nielson wrote:
> Steve,
>
> This error is coming from the function client_to_svr after a call to
> connect. Connect is failing with an ECONNREFUSED. You might want to
> try netstat and see if 1023 is even available.
>
> Ken Nielson
> Adaptive Computing
>
> Steve Young wrote:
>> Hi all,
>> I'm running torque version 2.4.3. I seem to be getting a strange
>> error message in the mom client logs on one of my new Altix nodes:
>>
>> 02/22/2010 13:27:33;0001;
>> pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now in progress (115) in
>> scan_for_exiting, cannot connect to port 1023 in client_to_svr -
>> connection refused
>> 02/22/2010 13:27:34;0001;
>> pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now in progress (115) in
>> scan_for_exiting, cannot connect to port 1023 in client_to_svr -
>> connection refused
>> 02/22/2010 13:27:35;0001;
>> pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now in progress (115) in
>> scan_for_exiting, cannot connect to port 1023 in client_to_svr -
>> connection refused
>> 02/22/2010 13:27:36;0001;
>> pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now in progress (115) in
>> scan_for_exiting, cannot connect to port 1023 in client_to_svr -
>> connection refused
>>
>> This continually repeats itself.
>>
>> Let me explain that this new client is in a remote data center and
>> we've been having to dork around with firewall rules and such to
>> make this work. I originally thought it was a firewall issue here
>> too but it seems as if the connections are indeed going through
>> when I sniff the interfaces on each host:
>>
>> On the qserver I see:
>>
>> [root at qserver server_logs]# tcpdump -i eth0 port 1023
>> tcpdump: verbose output suppressed, use -v or -vv for full
>> protocol decode
>> listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
>> 12:50:02.812206 IP clienthost.1023 > qserver.15001: UDP, length 411
>> 12:50:02.812423 IP qserver.15001 > clienthost.1023: UDP, length 26
>>
>>
>> and on the clienthost I also see:
>>
>> clienthost:/var/spool/torque/mom_logs # tcpdump -i eth0 port 1023
>> tcpdump: verbose output suppressed, use -v or -vv for full
>> protocol decode
>> listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
>> 12:50:07.356773 IP clienthost.1023 > qserver.15001: UDP, length 411
>> 12:50:07.411752 IP qserver.15001 > clienthost.1023: UDP, length 26
>> 12:50:52.974771 IP clienthost.1023 > qserver.15001: UDP, length 411
>> 12:50:53.030257 IP qserver.15001 > clienthost.1023: UDP, length 26
>>
>>
>> I also am seeing the server being able to connect to the client
>> with pbsnodes command:
>>
>> [root at qserver torque]# pbsnodes -a clienthost
>> clienthost
>> state = free
>> np = 128
>> properties = altix
>> ntype = cluster
>> jobs = 0/774702.qserver
>> status = opsys=linux,uname=Linux clienthost 2.6.16.60-0.42.5-
>> default #1 SMP Mon Aug 24 09:41:41 UTC 2009 ia64,sessions=?
>> 0,nsessions=? 0 ,nusers = 0 ,idletime = 5114 ,totmem =
>> 652464944kb ,availmem = 649426768kb ,physmem = 644652560kb ,ncpus =
>> 128 ,loadave = 0.00 ,netload
>> =19913039,state=free,jobs=774702.qserver,varattr=,rectime=1266866056
>>
>> In fact I am trying to run a job on the host but I also get this
>> message in the maui checkjob command:
>>
>> Messages: cannot start job - RM failure, rc: 15082, msg:
>> 'Premature end of message'
>>
>> I don't see what an error code 15082 is in the maui manual. The
>> job goes into the running state but never actually starts running
>> on the host. At first the server keeps trying to rerun the job
>> until it gets started. Once the server thinks it is running the
>> job goes off into la la land and I have to use a qdel -p to remove
>> it. So while I still might be inclined to think something was up
>> with the firewall I thought I might ask here to see what others
>> have seen with this. Perhaps I need to change some tcp timeouts or
>> something. On the client I'm able to run qmgr and run qstat
>> commands so I know I am close to getting this all working
>> properly. Any idea's what else I should look at? Let me know if
>> you need better clarification on anything. Thanks,
>>
>> -Steve
>>
>>
>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>
More information about the torqueusers
mailing list