[torqueusers] wrong pbs server name

Jerry Smith jdsmit at sandia.gov
Thu May 21 14:41:23 MDT 2009


127.0.0.1 is a special address that references localhost.
http://en.wikipedia.org/wiki/Localhost



127.0.0.1  is not what you want for your hostname ( pbs_moms trying to 
connect to 127.0.0.1 will try to talk to themselves)

You will want to setup an IP address on your pbs_server/scheduler node 
that corresponds to the network that your pbs_moms are on.
And then make sure that the hostname you give it matches that of the 
file in $PBS_HOME/server

Copying the init script to /etc/init.d is a start, you will then 
probably need to turn it on by running :

To set it up to start on reboot:

chkconfig add pbs_sched
and then
chkconfig pbs_sched on

To start it use /etc/init.d/pbs_sched start


--Jerry


Samir Gartner wrote:
> Ok Gus and everyone. Thanks again for your answers.
>
> There is no pbs_sched on /etc/init.d but it is here:
>
> /usr/local/src/torque-2.3.6/contrib/init.d/pbs_sched
> /usr/local/src/torque-2.3.6/tpackages/server/opt/pbs/sbin/pbs_sched
> /usr/local/src/torque-2.3.6/src/scheduler.cc/.libs/pbs_sched
> /usr/local/src/torque-2.3.6/src/scheduler.cc/pbs_sched
> /opt/pbs/sbin/pbs_sched
>
> I was thinking copying /opt/pbs/sbin/pbs_sched to /etc/init.d. Is it 
> right to do that?
>
> Sorry about the "manually" word. It is local slang I guess. What I 
> mean is that I went to the /opt/pbs/sbin/ folder and executed ./pbs_sched
>
> hostname output is:
>
> rufian.perrera.local
>
> hosts file contain:
>
> # Do not remove the following line, or various programs
> # that require network functionality will fail.
> #127.0.0.1              localhost.localdomain localhost    
> <--------------------------Is this wrong?
> ::1             localhost6.localdomain6 localhost6
> 127.0.0.1 rufian.perrera.local rufian
> 192.168.2.6 auyin.perrera.local auyin
> 192.168.2.4 pelusa.perrera.local pelusa
> 192.168.2.2 lamparita.perrera.local lamparita
>
>
> network content is:
>
> NETWORKING=yes
> HOSTNAME=rufian.perrera.local
> DOMAINNAME=perrera.local
>
> I dont have /etc/sysconfig/pbs_server nor /etc/sysconfig/pbs_sched either
>
>
> 2009/5/21 Gus Correa <gus at ldeo.columbia.edu 
> <mailto:gus at ldeo.columbia.edu>>
>
>     Samir Gartner wrote:
>     > Ok, scheduling wasn't enabled,now it is,
>
>     It happens very often.
>     Fixing it is a good first step.
>
>     > but pbs_sched service was not
>     > found.
>
>     Starting up daemons in YDog may be different from RHEL, CentOS,
>     Fedora,
>     so I am just guessing based on the latter. Not familiar to YDog.
>     Anyway ...
>
>     Don't know if you got Torque from ClusterResources or other.
>     In any case, there should be a pbs_sched script on /etc/init.d
>     If it is there, do "chkconfig --add pbs_sched" (or YDog equivalent),
>     then do "chkconfig --list pbs_sched" to see which runlevels it will be
>     on, then "service pbs_sched start" to start it, or if YDog doesn't
>     have
>     "service", run it with "/etc/init.d/pbs_sched start".
>
>     If you don't have the pbs_sched script in /etc/init.d, you may
>     find one
>     in the contrib subdirectory of the Torque source tree.
>     Copy it over to /etc/init.d, and do the above.
>     (The location may be other than /etc/init.d in YDog.)
>
>
>     > I didn't install maui, it is a default installation. About hosts
>     > file, it is properly configured as well as nodes and mom's
>     config files.
>     >
>
>     You only need Maui if you want a complex scheduling policy.
>     pbs_sched is FIFO, very simple, but works fine.
>     I've used it for a long time without problems.
>
>     > when I manually start pbs_sched it says
>     >
>     > pbs_sched: addclient, host localhost not found
>     >
>
>     Hmm ... never got this one, not that I remember.
>     Not sure what you mean by "manually start pbs_sched".
>     Anyway, sounds as another, different, problem.
>
>
>     Is it possible that your "hostname" command
>     is not resolving your server name to rufian.perrera.local but to
>     localhost?
>     What is the output of "hostname"?
>     What do you have in /etc/hosts?
>     What do you have in /etc/sysconfig/network?
>
>     Just in case you have  /etc/sysconfig/pbs_server and
>     /etc/sysconfig/pbs_sched, what is the contents?
>     (I don't have them.)
>
>     (Again just guessing, YDog may have different files to startup
>     things.)
>
>     I hope this helps,
>     Gus Correa
>     ---------------------------------------------------------------------
>     Gustavo Correa
>     Lamont-Doherty Earth Observatory - Columbia University
>     Palisades, NY, 10964-8000 - USA
>     ---------------------------------------------------------------------
>
>     >
>     > 2009/5/21 Samir Gartner <jigzat at gmail.com
>     <mailto:jigzat at gmail.com> <mailto:jigzat at gmail.com
>     <mailto:jigzat at gmail.com>>>
>     >
>     >     I think I'm gonna cry.... I love you guys!! No, seriously,
>     it worked
>     >     but only if executed under root user, now the question is
>     what did I
>     >     do wrong? Jobs should start automatically, right?
>     >
>     >     I was following first the Globus tootlikt tutorial but it is
>     kinda
>     >     outdated so I guess I issued some wrong instructions.
>     >
>     >     On of the weird things was that the tutorial suggested using the
>     >     /opt/pbs prefix when executing configure and now I have under
>     >     /opt/pbs again a /opt/pbs folder with repeated bin and sbin
>     folders
>     >     and executables. Is this wrong or is how it is supposed to be?
>     >
>     >     2009/5/21 Ling C. Ho <ling at fnal.gov <mailto:ling at fnal.gov>
>     <mailto:ling at fnal.gov <mailto:ling at fnal.gov>>>
>     >
>     >         Have you configured a scheduler?
>     >
>     >         What if you use qrun. Would any job starts?
>     >
>     >         ...
>     >         ling
>     >
>     >         Samir Gartner wrote:
>     >
>     >             Ok, I don't see any file named default_server but
>     >             server_name has the right server name
>     rufian.perrera.local
>     >             and there is another file with the same content named
>     >             server_name.new.
>     >
>     >             Righ now the PSB server name apears to be correct (after
>     >             stoping the server and manually deletting the zombie
>     jobs)
>     >             but stil the jobs won't start.
>     >
>     >
>     >             [samir at rufian ~]$ echo "sleep 30;date" |
>     /opt/pbs/bin/qsub
>     >             [samir at rufian ~]$ /opt/pbs/bin/qstat -a
>     >
>     >             rufian.perrera.local:
>     >
>     >                         Req'd  Req'd   Elap
>     >             Job ID               Username Queue    Jobname
>     >              SessID NDS   TSK Memory Time  S Time
>     >             -------------------- -------- -------- ----------------
>     >             ------ ----- --- ------ ----- - -----
>     >             13.rufian.perrer     samir    batch    STDIN
>     >             --      1  --    --  01:00 Q   --
>     >             [samir at rufian ~]$
>     >
>     >
>     >             by the way, is it top posting allowed??
>     >
>     >             2009/5/21 Jerry Smith <jdsmit at sandia.gov
>     <mailto:jdsmit at sandia.gov>
>     >             <mailto:jdsmit at sandia.gov
>     <mailto:jdsmit at sandia.gov>> <mailto:jdsmit at sandia.gov
>     <mailto:jdsmit at sandia.gov>
>     >             <mailto:jdsmit at sandia.gov <mailto:jdsmit at sandia.gov>>>>
>     >
>     >
>     >                Samir,
>     >
>     >                What do you have in
>     $PBS_HOME/{server_name,default_server}?
>     >
>     >                It should be what resolves as the ethernet
>     address that
>     >             pbs should
>     >                be listening on.
>     >
>     >                --Jerry
>     >
>     >
>     >
>     >
>     >                Samir Gartner wrote:
>     >
>     >                    Ok I finally installed torque under
>     yellowdog/ppc but
>     >             now I have
>     >                    another problem. I set up my pbs server as
>     >             rufian.perrera.local
>     >                    but when I issue a job it shows itself in
>     >             localhost.localdomain
>     >                    and it stays on queued state forever. And if
>     i try to
>     >             qdel the
>     >                    job it cant reach the server and the
>     conection times
>     >             out. Any
>     >                    ideas of what could be wrong?
>     >                    I'm not trying to set up anything complicated, is
>     >             just one
>     >                    machine that works as server and client.
>     >
>     >                    this is the shell output
>     >
>     >                    [root at rufian bin]# /opt/pbs/bin/qstat -a
>     >
>     >                    rufian.perrera.local:
>     >
>     >                                      Req'd  Req'd   Elap
>     >                    Job ID               Username Queue    Jobname
>     >                SessID
>     >                    NDS   TSK Memory Time  S Time
>     >                    -------------------- -------- --------
>     >             ---------------- ------
>     >                    ----- --- ------ ----- - -----
>     >                    7.localhost.loca     samir    batch    STDIN
>     >                   --             1  --    --  01:00 Q   --
>     >                    8.localhost.loca     samir    batch    STDIN
>     >                   --             1  --    --  01:00 Q   --
>     >                    9.localhost.loca     samir    batch    STDIN
>     >                   --             1  --    --  01:00 Q   --
>     >                    10.localhost.loc     samir    batch    STDIN
>     >                   --             1  --    --  01:00 Q   --
>     >                    [root at rufian bin]# /opt/pbs/bin/qdel
>     >             7.localhost.localdomain
>     >                    Connection timed out
>     >                    qdel: cannot connect to server
>     localhost.localdomain
>     >             (errno=110)
>     >                    Connection timed out
>     >                    You have new mail in /var/spool/mail/root
>     >                    [root at rufian bin]# /opt/pbs/bin/qdel
>     >             7.rufian.perrera.local
>     >                    qdel: Unknown Job Id 7.rufian.perrera.local
>     >                    [root at rufian bin]# su - samir
>     >                    [samir at rufian ~]$ /opt/pbs/bin/qdel
>     >             7.localhost.localdomain
>     >                    Connection timed out
>     >                    qdel: cannot connect to server
>     localhost.localdomain
>     >             (errno=110)
>     >                    Connection timed out
>     >                    [samir at rufian ~]$
>     >
>     >
>     >
>     >
>     >            
>     ------------------------------------------------------------------------
>     >
>     >             _______________________________________________
>     >             torqueusers mailing list
>     >             torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>
>     >             <mailto:torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>>
>     >             http://www.supercluster.org/mailman/listinfo/torqueusers
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     ------------------------------------------------------------------------
>     >
>     > _______________________________________________
>     > torqueusers mailing list
>     > torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>     > http://www.supercluster.org/mailman/listinfo/torqueusers
>
>     _______________________________________________
>     torqueusers mailing list
>     torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>     http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20090521/b7f4d7ba/attachment-0001.html 


More information about the torqueusers mailing list