[torqueusers] pbs_sched: badconn, "unauthorized host" on multihoned system

Lorin Hochstein lorin at isi.edu
Wed Aug 4 11:23:03 MDT 2010


The head node of my Ubuntu 10.04 cluster is multihoned: there's a public interface ("frontend") and a private interface ("frontend-int") . Recently there was a configuration change on the private interface, and I think that's preventing jobs from being submitted. 

I have the torque server configured to run on the public interface, which I did by adding the following line to /etc/init.d/torque_server: "DAEMON_SERVER_OPTS="-H frontend"

If I try to submit a job on frontend, it remains queued, even though there are nodes available. Each time I submit a job, I see that looks like the following line on /var/log/daemon.log: 

Aug  4 13:16:53 island100 pbs_sched: badconn, frontend-int on port 814 unauthorized host

If I look at the info about the submitted job, it says that the owner is on frontend-int, instead of frontend:

Job Id: 196.frontend
    Job_Name = test.sh
    Job_Owner = lorin at frontend-int
    job_state = Q
    queue = batch
    server = frontend
    Checkpoint = u
    ctime = Wed Aug  4 13:16:53 2010
    Error_Path = frontend:/home/lorin/test.sh.e196
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Wed Aug  4 13:16:53 2010
    Output_Path = frontend:/home/lorin/test.sh.o196
    Priority = 0
    qtime = Wed Aug  4 13:16:53 2010
    Rerunable = True
    Resource_List.nodect = 1
    Resource_List.nodes = 1
    Resource_List.walltime = 01:00:00
    Variable_List = PBS_O_HOME=/home/lorin,PBS_O_LANG=en_US.UTF-8,
        PBS_O_LOGNAME=lorin,
        PBS_O_PATH=/opt/TurboVNC/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin
        :/usr/bin:/sbin:/bin:/usr/games:/home/lorin/bin:/opt/euca2ools/bin:/op
        t/starccm+5.02.010/star/bin:/ansys_inc/v121/ansys/bin,
        PBS_O_MAIL=/var/mail/lorin,PBS_O_SHELL=/bin/bash,
        PBS_SERVER=frontend,PBS_O_HOST=frontend,
        PBS_O_WORKDIR=/home/lorin,PBS_O_QUEUE=batch
    etime = Wed Aug  4 13:16:53 2010
    submit_args = test.sh


It appears that it's associating the job queue with lorin at frontend-int instead of lorin at frontend. How does torque determine which host is associated with the job owner on a multihoned system, and how can I configure it so it associates jobs front frontend instead of frontend-int?

Thanks,

Lorin

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3910 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20100804/8e55b827/attachment-0001.bin 


More information about the torqueusers mailing list