[torqueusers] installation/configuration problem with multi-homed system. --- unauthorized host/request

Jason Bacon jwbacon at tds.net
Mon Nov 14 07:26:40 MST 2011


I had a similar issue and got around it by simply setting up /etc/hosts 
on each node properly.

On the multihomed head node, the hostname is bound to the external IP in 
/etc/hosts.  On the compute nodes, the hostname of the head node is 
bound to it's internal address.  Also be sure that name resolution on 
the compute nodes is configured to check files before DNS.

No special configuration was required within torque.

Regards,

     -J

On 11/13/11 09:48, liu junjun wrote:
> Hi everyone,
>
> I am trying to install torque-3.0.2 on a multi-homed system (two NIC 
> networks) but having an authority problem. Please read my description 
> on the problem below. Any helps are highly appreciated!
>
> ---- System information ----
> OS: Ubuntu 10.10
> eth0: external_host_name
> eth1: internal_host_name
> hostname: internal_hostname
> --------------------------------------------
>
> ---- Basic Torque information ----
> Torque version: 3.0.2
> content of /var/spool/torque/server_name: internal_host_name
> content of /var/spool/torque/torque.cfg: SERVERHOST internal_host_name
>
> server and nodes can ping each other with internal_host_name
> ----------------------------------------
>
>
> ---- the problem -------------
> 1. My first try on the installation:
> By following the installation document at 
> http://www.adaptivecomputing.com/resources/docs/torque/1.1installation.php, 
> I have problem with "torque.setup" script. It gave me "unauthorized 
> request". I noticed that the problem may related to my two NIC cards. 
> Then I double checked the server_name file and also added "SERVERHOST 
> interal_host_name" to torque.cfg. Unfortunately, problem sitll remains.
>
> 2. My 2nd try on the installation:
> I removed the first installation, and disabled eth0 which is 
> associated with external_host_name, and recompiled torque again with 
> the exactly same steps as that in my first try on the installation. 
> Everything seems fine. I can create a batch queue and can submit jobs 
> which run and terminate normally. However, once I enable eth0 
> (external_host_name), every qmgr command returns "unauthorized 
> request". I noticed that the server recognizes me as 
> user at external_host_name, whereas the pbs server is set as 
> internal_host_name which is also the hostname. I guess this causes the 
> "unauthorized" issue, so I made the following settings, by disabling 
> eth0 to get the authority on the operation:
> ====
> qmgr -c 's s acl_hosts += external_host_name'
> qmgr -c 's s managers += root at external_host_name'
> qmgr -c 's s operators += root at external_host_name'
> qmgr -c 's s submit_hosts += external_host_name'
> ====
>
> After the above commands, I gain the operational access to the 
> pbs_server even when eth0 is enabled. However,  all the submitted jobs 
> are still remain in the Q state. The followings are part of the 'qstat 
> -f' command and log files on the server:
> ==== part of 'qstat -f' command =====
> Job Id: 51.internal_host_name
>     Job_Name = STDIN
>     Job_Owner = user at exteral_host_name
>     job_state = Q
>     queue = batch
>     server = internal_host_name
>     Checkpoint = u
>     ctime = Sun Nov 13 19:25:12 2011
>     Error_Path = internal_host_name:/home/liu/STDIN.e51
>     Hold_Types = n
>     Join_Path = n
>     Keep_Files = n
>     Mail_Points = a
>     mtime = Sun Nov 13 19:25:12 2011
>     Output_Path = internal_host_name:/home/liu/STDIN.o51
> ===============================
>
> ==== part of pbs_server log ======
> 11/13/2011 19:25:05;0002;PBS_Server;Svr;PBS_Server;Torque Server 
> Version = 3.0.2, loglevel = 0
> 11/13/2011 19:25:12;0100;PBS_Server;Job;51.interal_host_name;enqueuing 
> into batch, state 1 hop 1
> 11/13/2011 19:25:12;0008;PBS_Server;Job;51.interal_host_name;Job 
> Queued at request of user at external_host_name, owner = 
> user at external_host_name, job name = STDIN, queue = batch
> 11/13/2011 19:25:12;0040;PBS_Server;Svr;cddlogin;Scheduler was sent 
> the command new
> 11/13/2011 19:25:12;0080;PBS_Server;Req;dis_request_read;req header 
> bad, dis error 7 (Premature end of message), type=Connect
> 11/13/2011 19:25:12;0080;PBS_Server;Req;req_reject;Reject reply 
> code=15058(Bad DIS based Request Protocol MSG=cannot decode message), 
> aux=0, type=Connect, from @
> 11/13/2011 19:25:12;0002;PBS_Server;Req;dis_reply_write;DIS reply 
> failure, -1
> =========================
>
> ==== part of pbs_sche log ======
> 11/13/2011 19:25:12;0001; pbs_sched;Svr;pbs_sched;LOG_ERROR::badconn, 
> external_host_name on port 762 unauthorized host
> ==========================
>
> As you can see from the above information, although exteral_host_name 
> is set as a submit_host, all jobs are still remain in 'Q' state 
> because the job owner is user at external_host_name! My question is :
> either 1. how to make the server to accept jobs from 
> users at external_host_name?
> or 2. how to make the server to recognize every submitted jobs as 
> belonging to user at internal_host_name?
>
> Thanks in advance!
>
> Junjun
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jason W. Bacon
jwbacon at tds.net
http://personalpages.tds.net/~jwbacon
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~




More information about the torqueusers mailing list