[torqueusers] wrong pbs server name

Samir Gartner jigzat at gmail.com
Thu May 21 20:58:26 MDT 2009


You guys RULES!!. Thank you so much!! I reinstalled everything and
configured according your instructions and it all went smooth.

For the record The GLOBUS Toolkit tutorial is outdated and is better to
follow the official TORQUE instructions
http://www.clusterresources.com/torquedocs21/index.shtml

To compile under YellowDog 6.1 over Playstation 3 one must execute this:

./configure --disable-gcc-warnings CC="gcc -m64"

aparently it suffers the same problem as with Mac OS X as does not support
-pendantic -Werror (what ever that is). This is just my guess since the
first time that I compile it, I got a bunch of recursive errors regarding
-pendantic -Werror.

I still have some questions but those are for pure curiosity.

What is -pendantic -Werror?

Does it make a big difference not having suport for it?

When I executed "make packages" I got a some shell scripts to install the
packages . I used only "mom" and "clients" but I also got "devel" which its
purpose is clear and server but even without explicity executing it I got
pbs_server installed into my node. So is it
torque-package-server-linux-powerpc64.sh a different kind of server? If so
what is its purpose?

2009/5/21 Samir Gartner <jigzat at gmail.com>

> PD: I got this warning after executing make packages for each one of the
> packages
>
> libtool: install: warning: remember to run `libtool --finish
> /usr/local/lib'
>
> should I execute it?
>
>
> 2009/5/21 Samir Gartner <jigzat at gmail.com>
>
>> Ok, I decided to reinstall everything and configure the system according
>> to everyone instructions and suggestions. But I have a doubt. In the
>> Globustoolkit tutorial instructions says to only execute mom and clients
>> shell scrips but not server.
>>
>> tar -zxf torque-2.0.0p7.tar.gz
>> cd torque-2.0.0p7
>> ./configure --prefix=/opt/pbs
>> make
>> make install
>> make packages
>> ./torque-package-clients-linux-i686.sh --install --destdir /opt/pbs
>> ./torque-package-mom-linux-i686.sh --install --destdir /opt/pbs
>>
>> As rufian node is the only node with torque, shouldn't I also execute the
>> server script?
>>
>>
>> 2009/5/21 Gus Correa <gus at ldeo.columbia.edu>
>>
>>> Hi Samir
>>>
>>> As Jerry said, 127.0.0.1 is the IP address of the "loopback interface"
>>> (not a physical Ethernet port) on each computer.
>>> This is not to be confused with the IP address associated to the actual
>>> Ethernet port on the network you want to use for MPI communication.
>>>
>>>
>>> 1) Looking at your hosts file and your question ("Is this wrong?"),
>>> I would suggest:
>>>
>>> A) Uncomment this line:
>>>
>>> #127.0.0.1              localhost.localdomain localhost
>>>
>>> i.e, it should be:
>>>
>>> 127.0.0.1              localhost.localdomain localhost
>>>
>>> You need a loopback, but pointing to localhost.
>>>
>>> B) Change this line:
>>>
>>> 127.0.0.1 rufian.perrera.local rufian
>>>
>>> to something like this:
>>>
>>> 192.168.2.1 rufian.perrera.local rufian
>>>
>>> assuming the IP address 192.168.2.1 is not in use on your (private) net
>>> 192.168.2.0 (otherwise use another IP on the same net).
>>>
>>> C) Make sure this is consistent with whatever you have in
>>> /etc/sysconfig/network.
>>> (It seems to be OK, you only have the hostname, not the IP there.)
>>>
>>> D) Restart the network, or much easier, just reboot the computer.
>>>
>>> E) Make sure your other computers (auyin, pelusa, lamparita)
>>> have correct hosts file too, which should
>>> list in a consistent way all the computers on your 192.168.2.0 net,
>>> include the loopback interface as explained above.
>>>
>>> F) Each computer loops back to itself with the same special address
>>> 127.0.0.1, as Jerry explained.
>>> The IP 127.0.0.1 cannot be used as a regular IP.
>>> If you read carefully the top lines on the hosts file you will see
>>> the message: "Do not remove the following line".
>>> Well, commenting it out has the same effect, replacing it by another
>>> hostname is even worse, as you may have noticed.
>>>
>>>  >> # Do not remove the following line, or various programs
>>>  >> # that require network functionality will fail.
>>>  >> #127.0.0.1              localhost.localdomain localhost
>>>
>>>
>>> ***
>>>
>>> 2) As I and Jerry told you, copy over the pbs_sched in the contrib
>>> directory to /etc/init.d, use chkconfig --add pbs_sched
>>> to come up when the machine boots, and start it manually this first
>>> time (/etc/init.d/pbs_sched start), or just reboot again.
>>>
>>>
>>> 3) Don't worry about not having /etc/sysconfig/pbs_server or
>>> /etc/sysconfig/pbs_sched.  It seems to be a legacy way to setup
>>> Torque/PBS.  I don't have them either, and it works.
>>>
>>> I hope this helps,
>>> Gus Correa
>>> ---------------------------------------------------------------------
>>> Gustavo Correa
>>> Lamont-Doherty Earth Observatory - Columbia University
>>> Palisades, NY, 10964-8000 - USA
>>> ---------------------------------------------------------------------
>>>
>>>
>>>
>>> Jerry Smith wrote:
>>> > 127.0.0.1 is a special address that references localhost.
>>> > http://en.wikipedia.org/wiki/Localhost
>>> >
>>> >
>>> >
>>> > 127.0.0.1  is not what you want for your hostname ( pbs_moms trying to
>>> > connect to 127.0.0.1 will try to talk to themselves)
>>> >
>>> > You will want to setup an IP address on your pbs_server/scheduler node
>>> > that corresponds to the network that your pbs_moms are on.
>>> > And then make sure that the hostname you give it matches that of the
>>> > file in $PBS_HOME/server
>>> >
>>> > Copying the init script to /etc/init.d is a start, you will then
>>> > probably need to turn it on by running :
>>> >
>>> > To set it up to start on reboot:
>>> >
>>> > chkconfig add pbs_sched
>>> > and then
>>> > chkconfig pbs_sched on
>>> >
>>> > To start it use /etc/init.d/pbs_sched start
>>> >
>>> >
>>> > --Jerry
>>> >
>>> >
>>> > Samir Gartner wrote:
>>> >> Ok Gus and everyone. Thanks again for your answers.
>>> >>
>>> >> There is no pbs_sched on /etc/init.d but it is here:
>>> >>
>>> >> /usr/local/src/torque-2.3.6/contrib/init.d/pbs_sched
>>> >> /usr/local/src/torque-2.3.6/tpackages/server/opt/pbs/sbin/pbs_sched
>>> >> /usr/local/src/torque-2.3.6/src/scheduler.cc/.libs/pbs_sched
>>> >> /usr/local/src/torque-2.3.6/src/scheduler.cc/pbs_sched
>>> >> /opt/pbs/sbin/pbs_sched
>>> >>
>>> >> I was thinking copying /opt/pbs/sbin/pbs_sched to /etc/init.d. Is it
>>> >> right to do that?
>>> >>
>>> >> Sorry about the "manually" word. It is local slang I guess. What I
>>> >> mean is that I went to the /opt/pbs/sbin/ folder and executed
>>> ./pbs_sched
>>> >>
>>> >> hostname output is:
>>> >>
>>> >> rufian.perrera.local
>>> >>
>>> >> hosts file contain:
>>> >>
>>> >> # Do not remove the following line, or various programs
>>> >> # that require network functionality will fail.
>>> >> #127.0.0.1              localhost.localdomain localhost
>>> >> <--------------------------Is this wrong?
>>> >> ::1             localhost6.localdomain6 localhost6
>>> >> 127.0.0.1 rufian.perrera.local rufian
>>> >> 192.168.2.6 auyin.perrera.local auyin
>>> >> 192.168.2.4 pelusa.perrera.local pelusa
>>> >> 192.168.2.2 lamparita.perrera.local lamparita
>>> >>
>>> >>
>>> >> network content is:
>>> >>
>>> >> NETWORKING=yes
>>> >> HOSTNAME=rufian.perrera.local
>>> >> DOMAINNAME=perrera.local
>>> >>
>>> >> I dont have /etc/sysconfig/pbs_server nor /etc/sysconfig/pbs_sched
>>> either
>>> >>
>>> >>
>>> >> 2009/5/21 Gus Correa <gus at ldeo.columbia.edu
>>> >> <mailto:gus at ldeo.columbia.edu>>
>>> >>
>>> >>     Samir Gartner wrote:
>>> >>     > Ok, scheduling wasn't enabled,now it is,
>>> >>
>>> >>     It happens very often.
>>> >>     Fixing it is a good first step.
>>> >>
>>> >>     > but pbs_sched service was not
>>> >>     > found.
>>> >>
>>> >>     Starting up daemons in YDog may be different from RHEL, CentOS,
>>> >>     Fedora,
>>> >>     so I am just guessing based on the latter. Not familiar to YDog.
>>> >>     Anyway ...
>>> >>
>>> >>     Don't know if you got Torque from ClusterResources or other.
>>> >>     In any case, there should be a pbs_sched script on /etc/init.d
>>> >>     If it is there, do "chkconfig --add pbs_sched" (or YDog
>>> equivalent),
>>> >>     then do "chkconfig --list pbs_sched" to see which runlevels it
>>> will be
>>> >>     on, then "service pbs_sched start" to start it, or if YDog doesn't
>>> >>     have
>>> >>     "service", run it with "/etc/init.d/pbs_sched start".
>>> >>
>>> >>     If you don't have the pbs_sched script in /etc/init.d, you may
>>> >>     find one
>>> >>     in the contrib subdirectory of the Torque source tree.
>>> >>     Copy it over to /etc/init.d, and do the above.
>>> >>     (The location may be other than /etc/init.d in YDog.)
>>> >>
>>> >>
>>> >>     > I didn't install maui, it is a default installation. About hosts
>>> >>     > file, it is properly configured as well as nodes and mom's
>>> >>     config files.
>>> >>     >
>>> >>
>>> >>     You only need Maui if you want a complex scheduling policy.
>>> >>     pbs_sched is FIFO, very simple, but works fine.
>>> >>     I've used it for a long time without problems.
>>> >>
>>> >>     > when I manually start pbs_sched it says
>>> >>     >
>>> >>     > pbs_sched: addclient, host localhost not found
>>> >>     >
>>> >>
>>> >>     Hmm ... never got this one, not that I remember.
>>> >>     Not sure what you mean by "manually start pbs_sched".
>>> >>     Anyway, sounds as another, different, problem.
>>> >>
>>> >>
>>> >>     Is it possible that your "hostname" command
>>> >>     is not resolving your server name to rufian.perrera.local but to
>>> >>     localhost?
>>> >>     What is the output of "hostname"?
>>> >>     What do you have in /etc/hosts?
>>> >>     What do you have in /etc/sysconfig/network?
>>> >>
>>> >>     Just in case you have  /etc/sysconfig/pbs_server and
>>> >>     /etc/sysconfig/pbs_sched, what is the contents?
>>> >>     (I don't have them.)
>>> >>
>>> >>     (Again just guessing, YDog may have different files to startup
>>> >>     things.)
>>> >>
>>> >>     I hope this helps,
>>> >>     Gus Correa
>>> >>
>>> ---------------------------------------------------------------------
>>> >>     Gustavo Correa
>>> >>     Lamont-Doherty Earth Observatory - Columbia University
>>> >>     Palisades, NY, 10964-8000 - USA
>>> >>
>>> ---------------------------------------------------------------------
>>> >>
>>> >>     >
>>> >>     > 2009/5/21 Samir Gartner <jigzat at gmail.com
>>> >>     <mailto:jigzat at gmail.com> <mailto:jigzat at gmail.com
>>> >>     <mailto:jigzat at gmail.com>>>
>>> >>     >
>>> >>     >     I think I'm gonna cry.... I love you guys!! No, seriously,
>>> >>     it worked
>>> >>     >     but only if executed under root user, now the question is
>>> >>     what did I
>>> >>     >     do wrong? Jobs should start automatically, right?
>>> >>     >
>>> >>     >     I was following first the Globus tootlikt tutorial but it is
>>> >>     kinda
>>> >>     >     outdated so I guess I issued some wrong instructions.
>>> >>     >
>>> >>     >     On of the weird things was that the tutorial suggested using
>>> the
>>> >>     >     /opt/pbs prefix when executing configure and now I have
>>> under
>>> >>     >     /opt/pbs again a /opt/pbs folder with repeated bin and sbin
>>> >>     folders
>>> >>     >     and executables. Is this wrong or is how it is supposed to
>>> be?
>>> >>     >
>>> >>     >     2009/5/21 Ling C. Ho <ling at fnal.gov <mailto:ling at fnal.gov>
>>> >>     <mailto:ling at fnal.gov <mailto:ling at fnal.gov>>>
>>> >>     >
>>> >>     >         Have you configured a scheduler?
>>> >>     >
>>> >>     >         What if you use qrun. Would any job starts?
>>> >>     >
>>> >>     >         ...
>>> >>     >         ling
>>> >>     >
>>> >>     >         Samir Gartner wrote:
>>> >>     >
>>> >>     >             Ok, I don't see any file named default_server but
>>> >>     >             server_name has the right server name
>>> >>     rufian.perrera.local
>>> >>     >             and there is another file with the same content
>>> named
>>> >>     >             server_name.new.
>>> >>     >
>>> >>     >             Righ now the PSB server name apears to be correct
>>> (after
>>> >>     >             stoping the server and manually deletting the zombie
>>> >>     jobs)
>>> >>     >             but stil the jobs won't start.
>>> >>     >
>>> >>     >
>>> >>     >             [samir at rufian ~]$ echo "sleep 30;date" |
>>> >>     /opt/pbs/bin/qsub
>>> >>     >             [samir at rufian ~]$ /opt/pbs/bin/qstat -a
>>> >>     >
>>> >>     >             rufian.perrera.local:
>>> >>     >
>>> >>     >                         Req'd  Req'd   Elap
>>> >>     >             Job ID               Username Queue    Jobname
>>> >>     >              SessID NDS   TSK Memory Time  S Time
>>> >>     >             -------------------- -------- --------
>>> ----------------
>>> >>     >             ------ ----- --- ------ ----- - -----
>>> >>     >             13.rufian.perrer     samir    batch    STDIN
>>> >>     >             --      1  --    --  01:00 Q   --
>>> >>     >             [samir at rufian ~]$
>>> >>     >
>>> >>     >
>>> >>     >             by the way, is it top posting allowed??
>>> >>     >
>>> >>     >             2009/5/21 Jerry Smith <jdsmit at sandia.gov
>>> >>     <mailto:jdsmit at sandia.gov>
>>> >>     >             <mailto:jdsmit at sandia.gov
>>> >>     <mailto:jdsmit at sandia.gov>> <mailto:jdsmit at sandia.gov
>>> >>     <mailto:jdsmit at sandia.gov>
>>> >>     >             <mailto:jdsmit at sandia.gov <mailto:jdsmit at sandia.gov
>>> >>>>
>>> >>     >
>>> >>     >
>>> >>     >                Samir,
>>> >>     >
>>> >>     >                What do you have in
>>> >>     $PBS_HOME/{server_name,default_server}?
>>> >>     >
>>> >>     >                It should be what resolves as the ethernet
>>> >>     address that
>>> >>     >             pbs should
>>> >>     >                be listening on.
>>> >>     >
>>> >>     >                --Jerry
>>> >>     >
>>> >>     >
>>> >>     >
>>> >>     >
>>> >>     >                Samir Gartner wrote:
>>> >>     >
>>> >>     >                    Ok I finally installed torque under
>>> >>     yellowdog/ppc but
>>> >>     >             now I have
>>> >>     >                    another problem. I set up my pbs server as
>>> >>     >             rufian.perrera.local
>>> >>     >                    but when I issue a job it shows itself in
>>> >>     >             localhost.localdomain
>>> >>     >                    and it stays on queued state forever. And if
>>> >>     i try to
>>> >>     >             qdel the
>>> >>     >                    job it cant reach the server and the
>>> >>     conection times
>>> >>     >             out. Any
>>> >>     >                    ideas of what could be wrong?
>>> >>     >                    I'm not trying to set up anything
>>> complicated, is
>>> >>     >             just one
>>> >>     >                    machine that works as server and client.
>>> >>     >
>>> >>     >                    this is the shell output
>>> >>     >
>>> >>     >                    [root at rufian bin]# /opt/pbs/bin/qstat -a
>>> >>     >
>>> >>     >                    rufian.perrera.local:
>>> >>     >
>>> >>     >                                      Req'd  Req'd   Elap
>>> >>     >                    Job ID               Username Queue
>>>  Jobname
>>> >>     >                SessID
>>> >>     >                    NDS   TSK Memory Time  S Time
>>> >>     >                    -------------------- -------- --------
>>> >>     >             ---------------- ------
>>> >>     >                    ----- --- ------ ----- - -----
>>> >>     >                    7.localhost.loca     samir    batch    STDIN
>>> >>     >                   --             1  --    --  01:00 Q   --
>>> >>     >                    8.localhost.loca     samir    batch    STDIN
>>> >>     >                   --             1  --    --  01:00 Q   --
>>> >>     >                    9.localhost.loca     samir    batch    STDIN
>>> >>     >                   --             1  --    --  01:00 Q   --
>>> >>     >                    10.localhost.loc     samir    batch    STDIN
>>> >>     >                   --             1  --    --  01:00 Q   --
>>> >>     >                    [root at rufian bin]# /opt/pbs/bin/qdel
>>> >>     >             7.localhost.localdomain
>>> >>     >                    Connection timed out
>>> >>     >                    qdel: cannot connect to server
>>> >>     localhost.localdomain
>>> >>     >             (errno=110)
>>> >>     >                    Connection timed out
>>> >>     >                    You have new mail in /var/spool/mail/root
>>> >>     >                    [root at rufian bin]# /opt/pbs/bin/qdel
>>> >>     >             7.rufian.perrera.local
>>> >>     >                    qdel: Unknown Job Id 7.rufian.perrera.local
>>> >>     >                    [root at rufian bin]# su - samir
>>> >>     >                    [samir at rufian ~]$ /opt/pbs/bin/qdel
>>> >>     >             7.localhost.localdomain
>>> >>     >                    Connection timed out
>>> >>     >                    qdel: cannot connect to server
>>> >>     localhost.localdomain
>>> >>     >             (errno=110)
>>> >>     >                    Connection timed out
>>> >>     >                    [samir at rufian ~]$
>>> >>     >
>>> >>     >
>>> >>     >
>>> >>     >
>>> >>     >
>>> >>
>>> ------------------------------------------------------------------------
>>> >>     >
>>> >>     >             _______________________________________________
>>> >>     >             torqueusers mailing list
>>> >>     >             torqueusers at supercluster.org
>>> >>     <mailto:torqueusers at supercluster.org>
>>> >>     >             <mailto:torqueusers at supercluster.org
>>> >>     <mailto:torqueusers at supercluster.org>>
>>> >>     >
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>> >>     >
>>> >>     >
>>> >>     >
>>> >>     >
>>> >>     >
>>> >>     >
>>> >>     >
>>> >>     >
>>> >>
>>> ------------------------------------------------------------------------
>>> >>     >
>>> >>     > _______________________________________________
>>> >>     > torqueusers mailing list
>>> >>     > torqueusers at supercluster.org <mailto:
>>> torqueusers at supercluster.org>
>>> >>     > http://www.supercluster.org/mailman/listinfo/torqueusers
>>> >>
>>> >>     _______________________________________________
>>> >>     torqueusers mailing list
>>> >>     torqueusers at supercluster.org <mailto:torqueusers at supercluster.org
>>> >
>>> >>     http://www.supercluster.org/mailman/listinfo/torqueusers
>>> >>
>>> >>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20090521/13e4096b/attachment-0001.html 


More information about the torqueusers mailing list