[torqueusers] Hostname mismatches in Torque

Lennart Karlsson Lennart.Karlsson at nsc.liu.se
Tue Sep 6 07:08:58 MDT 2005


Torque seems to name nodes in an inconsistent way when using
an internal IP network for the compute nodes. Here comes an
example:

The PBS server node has hostname "torn", an external IP number named
"torn" and an internal IP number named "n0".

A login node has has hostname "tornado", an external IP number named
"tornado" and an internal IP number named "l1".

The compute nodes each have one internal IP number named from the
series n1, n2, n3, ..., and the same hostname. They have no external
IP numbers.

>From the login node I run

	[lennart at tornado ~]% qsub -I -l nodes=4:ppn=2,walltime=1:00:00
	qsub: waiting for job 4361.torn to start
	qsub: job 4361.torn ready

	[lennart at n99 ~]% 

On the PBS server node I run

	[lennart at torn ~]% qstat -f 4361.torn
	Job Id: 4361.torn
	    Job_Name = STDIN
	    Job_Owner = lennart at l1
	    resources_used.cput = 00:00:00
	    resources_used.mem = 4784kb
	    resources_used.vmem = 66620kb
	    resources_used.walltime = 00:00:40
	    job_state = R
	    queue = workq
	    server = torn
	    Checkpoint = u
	    ctime = Tue Sep  6 13:12:41 2005
	    Error_Path = tornado:/home/lennart/STDIN.e4361
	    exec_host = n99/1+n99/0+n98/1+n98/0+n97/1+n97/0+n96/1+n96/0
	    Hold_Types = n
	    interactive = True
	    Join_Path = n
	    Keep_Files = n
	    Mail_Points = a
	    mtime = Tue Sep  6 13:13:24 2005
	    Output_Path = tornado:/home/lennart/STDIN.o4361
	    Priority = 0
	    qtime = Tue Sep  6 13:12:41 2005
	    Rerunable = True
	    Resource_List.neednodes = 4:ppn=2
	    Resource_List.nodect = 4
	    Resource_List.nodes = 4:ppn=2
	    Resource_List.walltime = 01:00:00
	    substate = 42
	    Variable_List = PBS_O_HOME=/home/lennart,PBS_O_LANG=en_US.UTF-8,
	        PBS_O_LOGNAME=lennart,PBS_O_PATH=/usr/kerberos/bin:/usr/pbs/bin:/usr/l
ocal/bin:/bin:/usr/bin
	        :/usr/local/intel/8.1/l_fce_pc_8.1.029/bin:/usr/local/intel/8.1/l_cce_
p
	        c_8.1.029/bin:/usr/local/intel/intel_idbe_80/bin:/opt/scali/bin:/opt/s
c
	        ali/sbin:/opt/scali/contrib/pbs/bin:/opt/scali/contrib/torque/bin:/usr
/
	        X11R6/bin,PBS_O_MAIL=/var/spool/mail/lennart,PBS_O_SHELL=/bin/tcsh,
	        PBS_O_HOST=tornado,PBS_O_WORKDIR=/home/lennart,PBS_O_QUEUE=workq
	    euser = lennart
	    egroup = nsc
	    hashname = 4361.torn
	    queue_rank = 979
	    queue_type = E
	    etime = Tue Sep  6 13:12:41 2005

	[lennart at torn ~]%

Communication between login node, PBS server node and computer nodes
are all the time running on the internal IP network and thus I appreciate
that the "Job_Owner" data actually mentions the internal host name "l1".

But otherwise it seems like all other job data are set to the external
name "tornado": Error_Path, Output_Path, and PBS_O_HOST. I also
have noted that the mom_superior (first node in job) tries to make
a "qsub sock" connection to the external IP interface of the login node.

It would be much better if all these host address references went to the
internal IP addresses, i.e. if the host address reference in the "Job_Owner"
data field was used also in those other places, because these host address
will be used on the compute nodes. (Trying to reach their external IP
addresses will probably fail, due to routing problems and/or firewalls.)

I would like this change to Torque, please.

Can this be made the default behavior, without wrecking havoc
with other, existing installations?

The second best alternative would be to configure into the pbs_server
configuration the preferred host names to use for different submit hosts.
In the pbs_server configuration file torque.cnf you may change the way
the PBS server host presents itself IP-wise, but (as of my understanding)
not the way other submit hosts present themselves.

Best regards,
-- Lennart Karlsson <Lennart.Karlsson at nsc.liu.se>
   National Supercomputer Centre in Linkoping, Sweden
   http://www.nsc.liu.se




More information about the torqueusers mailing list