[torqueusers] mom log error "cannot connect to port 1023 in client_to_svr"

Steve Young chemadm at hamilton.edu
Tue Feb 23 08:48:13 MST 2010


Hi Ken,
	Thanks for the insight. I check both hosts with netstat and see this:

Proto Recv-Q Send-Q Local Address               Foreign  
Address             State
udp        0      		0		 *:1023                      *:*

Which would also support when I saw in/out traffic on the interfaces  
when I sniffed them with tcpdump.

On my server I also see this in the logs:

02/23/2010 10:41:03;0004;PBS_Server;Svr;is_request;IS_STATUS received  
from clienthost

So some communication is going through. Any other idea's? Thanks,

-Steve




Also on a side note I just noticed a problem with the mom_priv/config  
options. I'm using torque 2.4.3 and when I  tried to do the following  
statements in the config file:

$size[fs=/scratch]
$arch ia64
$opsys SUSE10

Like the documentation on page ( http://www.clusterresources.com/products/torque/docs/a.cmomconfig.shtml 
  ) suggests it complains with:

02/23/2010 09:40:05;0001;    
pbs_mom;Svr;pbs_mom;LOG_ERROR::read_config, special command name size  
not found (ignoring line)
02/23/2010 09:40:05;0001;    
pbs_mom;Svr;pbs_mom;LOG_ERROR::read_config, special command name arch  
not found (ignoring line)
02/23/2010 09:40:05;0001;    
pbs_mom;Svr;pbs_mom;LOG_ERROR::read_config, special command name opsys  
not found (ignoring line)

After I remove the $ in front of them then it works fine. In config:

size[fs=/scratch]
arch ia64
opsys SUSE10

in logs:

02/23/2010 09:41:27;0080;   pbs_mom;n/a;add_static;config[4] add name  
size value [fs=/scratch]
02/23/2010 09:41:27;0080;   pbs_mom;n/a;add_static;config[5] add name  
arch value ia64
02/23/2010 09:41:27;0080;   pbs_mom;n/a;add_static;config[6] add name  
opsys value SUSE10



On Feb 22, 2010, at 4:08 PM, Ken Nielson wrote:

> Steve,
>
> This error is coming from the function client_to_svr after a call to  
> connect. Connect is failing with an ECONNREFUSED. You might want to  
> try netstat and see if 1023 is even available.
>
> Ken Nielson
> Adaptive Computing
>
> Steve Young wrote:
>> Hi all,
>> 	I'm running torque version 2.4.3.  I seem to be getting a strange   
>> error message in the mom client logs on one of my new Altix nodes:
>>
>> 02/22/2010 13:27:33;0001;    
>> pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation  now in progress (115) in  
>> scan_for_exiting, cannot connect to port 1023  in client_to_svr -  
>> connection refused
>> 02/22/2010 13:27:34;0001;    
>> pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation  now in progress (115) in  
>> scan_for_exiting, cannot connect to port 1023  in client_to_svr -  
>> connection refused
>> 02/22/2010 13:27:35;0001;    
>> pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation  now in progress (115) in  
>> scan_for_exiting, cannot connect to port 1023  in client_to_svr -  
>> connection refused
>> 02/22/2010 13:27:36;0001;    
>> pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation  now in progress (115) in  
>> scan_for_exiting, cannot connect to port 1023  in client_to_svr -  
>> connection refused
>>
>> This continually repeats itself.
>>
>> Let me explain that this new client is in a remote data center and   
>> we've been having to dork around with firewall rules and such to  
>> make  this work. I originally thought it was a firewall issue here  
>> too but  it seems as if the connections are indeed going through  
>> when I sniff  the interfaces on each host:
>>
>> On the qserver I see:
>>
>> [root at qserver server_logs]# tcpdump -i eth0 port 1023
>> tcpdump: verbose output suppressed, use -v or -vv for full  
>> protocol  decode
>> listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
>> 12:50:02.812206 IP clienthost.1023 > qserver.15001: UDP, length 411
>> 12:50:02.812423 IP qserver.15001 > clienthost.1023: UDP, length 26
>>
>>
>> and on the clienthost I also see:
>>
>> clienthost:/var/spool/torque/mom_logs # tcpdump -i eth0 port 1023
>> tcpdump: verbose output suppressed, use -v or -vv for full  
>> protocol  decode
>> listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
>> 12:50:07.356773 IP clienthost.1023 > qserver.15001: UDP, length 411
>> 12:50:07.411752 IP qserver.15001 > clienthost.1023: UDP, length 26
>> 12:50:52.974771 IP clienthost.1023 > qserver.15001: UDP, length 411
>> 12:50:53.030257 IP qserver.15001 > clienthost.1023: UDP, length 26
>>
>>
>> I also am seeing the server being able to connect to the client  
>> with  pbsnodes command:
>>
>> [root at qserver torque]# pbsnodes -a clienthost
>> clienthost
>>      state = free
>>      np = 128
>>      properties = altix
>>      ntype = cluster
>>      jobs = 0/774702.qserver
>>      status = opsys=linux,uname=Linux clienthost 2.6.16.60-0.42.5-  
>> default #1 SMP Mon Aug 24 09:41:41 UTC 2009 ia64,sessions=?   
>> 0,nsessions=?  0 ,nusers = 0 ,idletime = 5114 ,totmem =  
>> 652464944kb ,availmem = 649426768kb ,physmem = 644652560kb ,ncpus =  
>> 128 ,loadave = 0.00 ,netload  
>> =19913039,state=free,jobs=774702.qserver,varattr=,rectime=1266866056
>>
>> In fact I am trying to run a job on the host but I also get this   
>> message in the maui checkjob command:
>>
>> Messages:  cannot start job - RM failure, rc: 15082, msg:  
>> 'Premature  end of message'
>>
>> I don't see what an error code 15082 is in the maui manual. The  
>> job  goes into the running state but never actually starts running  
>> on the  host. At first the server keeps trying to rerun the job  
>> until it gets  started. Once the server thinks it is running the  
>> job goes off into la  la land and I have to use a qdel -p to remove  
>> it.  So while I still  might be inclined to think something was up  
>> with the firewall I  thought I might ask here to see what others  
>> have seen with this.  Perhaps I need to change some tcp timeouts or  
>> something. On the client  I'm able to run qmgr and run qstat  
>> commands so I know I am close to  getting this all working  
>> properly. Any idea's what else I should look  at? Let me know if  
>> you need better clarification on anything. Thanks,
>>
>> -Steve
>>
>>
>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>



More information about the torqueusers mailing list