[torqueusers] Problem with canonical hostnames in mom_priv/nodes file

Ken Nielson knielson at clusterresources.com
Tue Mar 31 11:27:20 MDT 2009


I looked at the function process_host_name_part in node_func.c and I do not understand why it even cares whether hp->h_addr_list[1] is NULL or not. It appears the function is interested in finding an array of addresses for objname along with an aliases without a :ts in the name and its type.

 if (hp->h_addr_list[1] == NULL)
    /* weren't given canonical name */

The comment seems to indicate if hp->h_addr_list[1] is NULL then we don't have a canonical name. Can anyone explain why hp->h_addr_list[1] has to be non-null to be canonical?

Unless there is a compelling reason not to I think we should just skip the predicate for hp->h_addr_list[1] and simply process the address and name information received from gethostbyname().

Ken Nielson
Cluster Resources, Inc.

----- Original Message -----
From: "Michael Marti" <michael.marti at ist.utl.pt>
To: torqueusers at supercluster.org
Sent: Monday, March 30, 2009 7:05:06 PM GMT -07:00 US/Canada Mountain
Subject: [torqueusers] Problem with canonical hostnames in mom_priv/nodes file

Dear All

We are using torque-2.3.6 on aix (AIX r1blade066 3 5 00003222D100)

On the head-node in /etc/hosts compute nodes have the following entry:     r1blade001 r1blade001m r1blade001q      # Rack 1,  
BladeCenter1, blade 1    r1blade001 r1blade001i   # Rack 1, BladeCenter1, blade 1

If we specify the nodes with the m suffix (as in r1blade001m) in the  
file server_priv/nodes everything works. However if we specify the  
host without suffix (as in r1blade001) pbs_server exits with the  
following error:

PBS_Server: process_host_name_part, no valid IP addresses found for  
'r1blade001' - check name service
PBS_Server: pbsd_init(setup_nodes), could not create node  
"r1blade001", error = 15010
PBS_Server: PBS_Server, pbsd_init failed

In the file src/server/node_func.c in the function  
process_host_name_part() the host ipaddrs are not counted in case we  
had more than one address on line 970. Essentially there should be one  
more section counting the ip addresses after line 1126.
This is in agreement with the above symptom: if given r1blade001m  
there will be only one IP on line 970. If given r1blade001 there will  
be two IPs on line 970.

A quick and dirty fix could be to set the second IP to NULL just  
before line 970 thus forcing the server always to assume a non  
canonical name, for which the code is ok.
My line 969 of file src/server/node_func.c reads:
h_addr_list[1] = NULL;

This works for us.

A better solution of course would be to take the ip counting bit out  
of the if clause on line 970.

Best regards,
Michael Marti

Michael Marti
Instituto Superior Técnico
Instituto de Plasmas e Fusão Nuclear
Complexo Interdisciplinar
Av. Rovisco Pais
1049-001 Lisboa

Tel:       +351 218 419 379
Fax:      +351 218 464 455
Mobile:  +351 968 434 327

torqueusers mailing list
torqueusers at supercluster.org

More information about the torqueusers mailing list