[torqueusers] pbs_sched: badconn, "unauthorized host" on multihoned system

Steve Crusan scrusan at ur.rochester.edu
Wed Aug 4 13:12:20 MDT 2010


I believe you need to check the /var/spool/pbs/server_name file for the
pbs_moms on your nodes.

Also, check that the pbs_server configurations files (torque.cfg) use the
correct hostname, as well as the pbs_mom files.

Recently, we upgraded to a new version of torque, and it seemed some of the
configuration files were changed. We were having similar problems.

Good luck!


On 8/4/10 1:23 PM, "Lorin Hochstein" <lorin at isi.edu> wrote:

> The head node of my Ubuntu 10.04 cluster is multihoned: there's a public
> interface ("frontend") and a private interface ("frontend-int") . Recently
> there was a configuration change on the private interface, and I think that's
> preventing jobs from being submitted.
> 
> I have the torque server configured to run on the public interface, which I
> did by adding the following line to /etc/init.d/torque_server:
> "DAEMON_SERVER_OPTS="-H frontend"
> 
> If I try to submit a job on frontend, it remains queued, even though there are
> nodes available. Each time I submit a job, I see that looks like the following
> line on /var/log/daemon.log:
> 
> Aug  4 13:16:53 island100 pbs_sched: badconn, frontend-int on port 814
> unauthorized host
> 
> If I look at the info about the submitted job, it says that the owner is on
> frontend-int, instead of frontend:
> 
> Job Id: 196.frontend
>     Job_Name = test.sh
>     Job_Owner = lorin at frontend-int
>     job_state = Q
>     queue = batch
>     server = frontend
>     Checkpoint = u
>     ctime = Wed Aug  4 13:16:53 2010
>     Error_Path = frontend:/home/lorin/test.sh.e196
>     Hold_Types = n
>     Join_Path = n
>     Keep_Files = n
>     Mail_Points = a
>     mtime = Wed Aug  4 13:16:53 2010
>     Output_Path = frontend:/home/lorin/test.sh.o196
>     Priority = 0
>     qtime = Wed Aug  4 13:16:53 2010
>     Rerunable = True
>     Resource_List.nodect = 1
>     Resource_List.nodes = 1
>     Resource_List.walltime = 01:00:00
>     Variable_List = PBS_O_HOME=/home/lorin,PBS_O_LANG=en_US.UTF-8,
>         PBS_O_LOGNAME=lorin,
>         PBS_O_PATH=/opt/TurboVNC/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin
>         :/usr/bin:/sbin:/bin:/usr/games:/home/lorin/bin:/opt/euca2ools/bin:/op
>         t/starccm+5.02.010/star/bin:/ansys_inc/v121/ansys/bin,
>         PBS_O_MAIL=/var/mail/lorin,PBS_O_SHELL=/bin/bash,
>         PBS_SERVER=frontend,PBS_O_HOST=frontend,
>         PBS_O_WORKDIR=/home/lorin,PBS_O_QUEUE=batch
>     etime = Wed Aug  4 13:16:53 2010
>     submit_args = test.sh
> 
> 
> It appears that it's associating the job queue with lorin at frontend-int instead
> of lorin at frontend. How does torque determine which host is associated with the
> job owner on a multihoned system, and how can I configure it so it associates
> jobs front frontend instead of frontend-int?
> 
> Thanks,
> 
> Lorin
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



----------------------
Steve Crusan
System Administrator
Center for Research Computing
University of Rochester
https://www.crc.rochester.edu/



More information about the torqueusers mailing list