[torqueusers] pbs_sched: badconn, "unauthorized host" on multihoned system
Steve Crusan
scrusan at ur.rochester.edu
Wed Aug 4 13:12:20 MDT 2010
I believe you need to check the /var/spool/pbs/server_name file for the
pbs_moms on your nodes.
Also, check that the pbs_server configurations files (torque.cfg) use the
correct hostname, as well as the pbs_mom files.
Recently, we upgraded to a new version of torque, and it seemed some of the
configuration files were changed. We were having similar problems.
Good luck!
On 8/4/10 1:23 PM, "Lorin Hochstein" <lorin at isi.edu> wrote:
> The head node of my Ubuntu 10.04 cluster is multihoned: there's a public
> interface ("frontend") and a private interface ("frontend-int") . Recently
> there was a configuration change on the private interface, and I think that's
> preventing jobs from being submitted.
>
> I have the torque server configured to run on the public interface, which I
> did by adding the following line to /etc/init.d/torque_server:
> "DAEMON_SERVER_OPTS="-H frontend"
>
> If I try to submit a job on frontend, it remains queued, even though there are
> nodes available. Each time I submit a job, I see that looks like the following
> line on /var/log/daemon.log:
>
> Aug 4 13:16:53 island100 pbs_sched: badconn, frontend-int on port 814
> unauthorized host
>
> If I look at the info about the submitted job, it says that the owner is on
> frontend-int, instead of frontend:
>
> Job Id: 196.frontend
> Job_Name = test.sh
> Job_Owner = lorin at frontend-int
> job_state = Q
> queue = batch
> server = frontend
> Checkpoint = u
> ctime = Wed Aug 4 13:16:53 2010
> Error_Path = frontend:/home/lorin/test.sh.e196
> Hold_Types = n
> Join_Path = n
> Keep_Files = n
> Mail_Points = a
> mtime = Wed Aug 4 13:16:53 2010
> Output_Path = frontend:/home/lorin/test.sh.o196
> Priority = 0
> qtime = Wed Aug 4 13:16:53 2010
> Rerunable = True
> Resource_List.nodect = 1
> Resource_List.nodes = 1
> Resource_List.walltime = 01:00:00
> Variable_List = PBS_O_HOME=/home/lorin,PBS_O_LANG=en_US.UTF-8,
> PBS_O_LOGNAME=lorin,
> PBS_O_PATH=/opt/TurboVNC/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin
> :/usr/bin:/sbin:/bin:/usr/games:/home/lorin/bin:/opt/euca2ools/bin:/op
> t/starccm+5.02.010/star/bin:/ansys_inc/v121/ansys/bin,
> PBS_O_MAIL=/var/mail/lorin,PBS_O_SHELL=/bin/bash,
> PBS_SERVER=frontend,PBS_O_HOST=frontend,
> PBS_O_WORKDIR=/home/lorin,PBS_O_QUEUE=batch
> etime = Wed Aug 4 13:16:53 2010
> submit_args = test.sh
>
>
> It appears that it's associating the job queue with lorin at frontend-int instead
> of lorin at frontend. How does torque determine which host is associated with the
> job owner on a multihoned system, and how can I configure it so it associates
> jobs front frontend instead of frontend-int?
>
> Thanks,
>
> Lorin
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
----------------------
Steve Crusan
System Administrator
Center for Research Computing
University of Rochester
https://www.crc.rochester.edu/
More information about the torqueusers
mailing list