[torqueusers] Problem with canonical hostnames in mom_priv/nodes file

Michael Marti michael.marti at ist.utl.pt
Wed Apr 1 05:52:21 MDT 2009


I completely agree with you.
I guess checking on hp->h_addr_list[1] was used for optimization. In  
certain cases this could avoid the need to do a reverse lookup and  
then go from this new host-name. For instance in version 2.1.11 this  
mechanism was correctly implemented (the ip count was made outside the  
if clause). However:

- this shortcut only works when the initial name is canonical
- if the host has only one interface (one ip) a reverse lookup will be  
still done, even when the initial name was in fact canonical
- if the host is multi-homed, but does not have a canonical name (the  
first name of each interface is not identical) this procedure will  
only get the ip addresses of the strain which contains the initial  
name. And this is independent of weather we do a reverse lookup or not.

So yes. I agree that it should be no problem to skip the predicate for  
hp->h_addr_list[1]. I just would like to emphasize that either way if  
there is no canonical host-name torque will not get the full set of  
ip's. This should not be a problem. After all one can chose what  
network torque should communicate on by  selecting corresponding host- 
names in the server_priv/nodes file. Of course I am not aware of the  
possible reasons torque might want to know any extra ip's.

Best regards,

P.S. I am sorry for not submitting a correct e-mail replay, but for  
some reason the torqueusers mailing list does not send me any messages.

> Michael,
> I looked at the function process_host_name_part in node_func.c and I  
> do not understand why it even cares whether hp->h_addr_list[1] is  
> NULL or not. It appears the function is interested in finding an  
> array of addresses for objname along with an aliases without a :ts  
> in the name and its type.
>  if (hp->h_addr_list[1] == NULL)
>     {
>     /* weren't given canonical name */
> The comment seems to indicate if hp->h_addr_list[1] is NULL then we  
> don't have a canonical name. Can anyone explain why hp- 
> >h_addr_list[1] has to be non-null to be canonical?
> Unless there is a compelling reason not to I think we should just  
> skip the predicate for hp->h_addr_list[1] and simply process the  
> address and name information received from gethostbyname().
> Ken Nielson
> Cluster Resources, Inc.
>> ----- Original Message -----
>> From: "Michael Marti" <michael.marti at ist.utl.pt>
>> To: torqueusers at supercluster.org
>> Sent: Monday, March 30, 2009 7:05:06 PM GMT -07:00 US/Canada Mountain
>> Subject: [torqueusers] Problem with canonical hostnames in mom_priv/ 
>> nodes file
>> Dear All
>> We are using torque-2.3.6 on aix (AIX r1blade066 3 5 00003222D100)
>> On the head-node in /etc/hosts compute nodes have the following  
>> entry:
>>     r1blade001 r1blade001m r1blade001q      # Rack 1,
>> BladeCenter1, blade 1
>>    r1blade001 r1blade001i   # Rack 1, BladeCenter1,  
>> blade 1
>> If we specify the nodes with the m suffix (as in r1blade001m) in the
>> file server_priv/nodes everything works. However if we specify the
>> host without suffix (as in r1blade001) pbs_server exits with the
>> following error:
>> PBS_Server: process_host_name_part, no valid IP addresses found for
>> 'r1blade001' - check name service
>> PBS_Server: pbsd_init(setup_nodes), could not create node
>> "r1blade001", error = 15010
>> PBS_Server: PBS_Server, pbsd_init failed
>> In the file src/server/node_func.c in the function
>> process_host_name_part() the host ipaddrs are not counted in case we
>> had more than one address on line 970. Essentially there should be  
>> one
>> more section counting the ip addresses after line 1126.
>> This is in agreement with the above symptom: if given r1blade001m
>> there will be only one IP on line 970. If given r1blade001 there will
>> be two IPs on line 970.
>> A quick and dirty fix could be to set the second IP to NULL just
>> before line 970 thus forcing the server always to assume a non
>> canonical name, for which the code is ok.
>> My line 969 of file src/server/node_func.c reads:
>> h_addr_list[1] = NULL;
>> This works for us.
>> A better solution of course would be to take the ip counting bit out
>> of the if clause on line 970.
>> Best regards,
>> Michael Marti

Michael Marti
Instituto Superior Técnico
Instituto de Plasmas e Fusão Nuclear
Complexo Interdisciplinar
Av. Rovisco Pais
1049-001 Lisboa

Tel:       +351 218 419 379
Fax:      +351 218 464 455
Mobile:  +351 968 434 327

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20090401/b523d5ed/attachment.html

More information about the torqueusers mailing list