[torqueusers] wrong pbs server name
Jerry Smith
jdsmit at sandia.gov
Thu May 21 14:41:23 MDT 2009
127.0.0.1 is a special address that references localhost.
http://en.wikipedia.org/wiki/Localhost
127.0.0.1 is not what you want for your hostname ( pbs_moms trying to
connect to 127.0.0.1 will try to talk to themselves)
You will want to setup an IP address on your pbs_server/scheduler node
that corresponds to the network that your pbs_moms are on.
And then make sure that the hostname you give it matches that of the
file in $PBS_HOME/server
Copying the init script to /etc/init.d is a start, you will then
probably need to turn it on by running :
To set it up to start on reboot:
chkconfig add pbs_sched
and then
chkconfig pbs_sched on
To start it use /etc/init.d/pbs_sched start
--Jerry
Samir Gartner wrote:
> Ok Gus and everyone. Thanks again for your answers.
>
> There is no pbs_sched on /etc/init.d but it is here:
>
> /usr/local/src/torque-2.3.6/contrib/init.d/pbs_sched
> /usr/local/src/torque-2.3.6/tpackages/server/opt/pbs/sbin/pbs_sched
> /usr/local/src/torque-2.3.6/src/scheduler.cc/.libs/pbs_sched
> /usr/local/src/torque-2.3.6/src/scheduler.cc/pbs_sched
> /opt/pbs/sbin/pbs_sched
>
> I was thinking copying /opt/pbs/sbin/pbs_sched to /etc/init.d. Is it
> right to do that?
>
> Sorry about the "manually" word. It is local slang I guess. What I
> mean is that I went to the /opt/pbs/sbin/ folder and executed ./pbs_sched
>
> hostname output is:
>
> rufian.perrera.local
>
> hosts file contain:
>
> # Do not remove the following line, or various programs
> # that require network functionality will fail.
> #127.0.0.1 localhost.localdomain localhost
> <--------------------------Is this wrong?
> ::1 localhost6.localdomain6 localhost6
> 127.0.0.1 rufian.perrera.local rufian
> 192.168.2.6 auyin.perrera.local auyin
> 192.168.2.4 pelusa.perrera.local pelusa
> 192.168.2.2 lamparita.perrera.local lamparita
>
>
> network content is:
>
> NETWORKING=yes
> HOSTNAME=rufian.perrera.local
> DOMAINNAME=perrera.local
>
> I dont have /etc/sysconfig/pbs_server nor /etc/sysconfig/pbs_sched either
>
>
> 2009/5/21 Gus Correa <gus at ldeo.columbia.edu
> <mailto:gus at ldeo.columbia.edu>>
>
> Samir Gartner wrote:
> > Ok, scheduling wasn't enabled,now it is,
>
> It happens very often.
> Fixing it is a good first step.
>
> > but pbs_sched service was not
> > found.
>
> Starting up daemons in YDog may be different from RHEL, CentOS,
> Fedora,
> so I am just guessing based on the latter. Not familiar to YDog.
> Anyway ...
>
> Don't know if you got Torque from ClusterResources or other.
> In any case, there should be a pbs_sched script on /etc/init.d
> If it is there, do "chkconfig --add pbs_sched" (or YDog equivalent),
> then do "chkconfig --list pbs_sched" to see which runlevels it will be
> on, then "service pbs_sched start" to start it, or if YDog doesn't
> have
> "service", run it with "/etc/init.d/pbs_sched start".
>
> If you don't have the pbs_sched script in /etc/init.d, you may
> find one
> in the contrib subdirectory of the Torque source tree.
> Copy it over to /etc/init.d, and do the above.
> (The location may be other than /etc/init.d in YDog.)
>
>
> > I didn't install maui, it is a default installation. About hosts
> > file, it is properly configured as well as nodes and mom's
> config files.
> >
>
> You only need Maui if you want a complex scheduling policy.
> pbs_sched is FIFO, very simple, but works fine.
> I've used it for a long time without problems.
>
> > when I manually start pbs_sched it says
> >
> > pbs_sched: addclient, host localhost not found
> >
>
> Hmm ... never got this one, not that I remember.
> Not sure what you mean by "manually start pbs_sched".
> Anyway, sounds as another, different, problem.
>
>
> Is it possible that your "hostname" command
> is not resolving your server name to rufian.perrera.local but to
> localhost?
> What is the output of "hostname"?
> What do you have in /etc/hosts?
> What do you have in /etc/sysconfig/network?
>
> Just in case you have /etc/sysconfig/pbs_server and
> /etc/sysconfig/pbs_sched, what is the contents?
> (I don't have them.)
>
> (Again just guessing, YDog may have different files to startup
> things.)
>
> I hope this helps,
> Gus Correa
> ---------------------------------------------------------------------
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> ---------------------------------------------------------------------
>
> >
> > 2009/5/21 Samir Gartner <jigzat at gmail.com
> <mailto:jigzat at gmail.com> <mailto:jigzat at gmail.com
> <mailto:jigzat at gmail.com>>>
> >
> > I think I'm gonna cry.... I love you guys!! No, seriously,
> it worked
> > but only if executed under root user, now the question is
> what did I
> > do wrong? Jobs should start automatically, right?
> >
> > I was following first the Globus tootlikt tutorial but it is
> kinda
> > outdated so I guess I issued some wrong instructions.
> >
> > On of the weird things was that the tutorial suggested using the
> > /opt/pbs prefix when executing configure and now I have under
> > /opt/pbs again a /opt/pbs folder with repeated bin and sbin
> folders
> > and executables. Is this wrong or is how it is supposed to be?
> >
> > 2009/5/21 Ling C. Ho <ling at fnal.gov <mailto:ling at fnal.gov>
> <mailto:ling at fnal.gov <mailto:ling at fnal.gov>>>
> >
> > Have you configured a scheduler?
> >
> > What if you use qrun. Would any job starts?
> >
> > ...
> > ling
> >
> > Samir Gartner wrote:
> >
> > Ok, I don't see any file named default_server but
> > server_name has the right server name
> rufian.perrera.local
> > and there is another file with the same content named
> > server_name.new.
> >
> > Righ now the PSB server name apears to be correct (after
> > stoping the server and manually deletting the zombie
> jobs)
> > but stil the jobs won't start.
> >
> >
> > [samir at rufian ~]$ echo "sleep 30;date" |
> /opt/pbs/bin/qsub
> > [samir at rufian ~]$ /opt/pbs/bin/qstat -a
> >
> > rufian.perrera.local:
> >
> > Req'd Req'd Elap
> > Job ID Username Queue Jobname
> > SessID NDS TSK Memory Time S Time
> > -------------------- -------- -------- ----------------
> > ------ ----- --- ------ ----- - -----
> > 13.rufian.perrer samir batch STDIN
> > -- 1 -- -- 01:00 Q --
> > [samir at rufian ~]$
> >
> >
> > by the way, is it top posting allowed??
> >
> > 2009/5/21 Jerry Smith <jdsmit at sandia.gov
> <mailto:jdsmit at sandia.gov>
> > <mailto:jdsmit at sandia.gov
> <mailto:jdsmit at sandia.gov>> <mailto:jdsmit at sandia.gov
> <mailto:jdsmit at sandia.gov>
> > <mailto:jdsmit at sandia.gov <mailto:jdsmit at sandia.gov>>>>
> >
> >
> > Samir,
> >
> > What do you have in
> $PBS_HOME/{server_name,default_server}?
> >
> > It should be what resolves as the ethernet
> address that
> > pbs should
> > be listening on.
> >
> > --Jerry
> >
> >
> >
> >
> > Samir Gartner wrote:
> >
> > Ok I finally installed torque under
> yellowdog/ppc but
> > now I have
> > another problem. I set up my pbs server as
> > rufian.perrera.local
> > but when I issue a job it shows itself in
> > localhost.localdomain
> > and it stays on queued state forever. And if
> i try to
> > qdel the
> > job it cant reach the server and the
> conection times
> > out. Any
> > ideas of what could be wrong?
> > I'm not trying to set up anything complicated, is
> > just one
> > machine that works as server and client.
> >
> > this is the shell output
> >
> > [root at rufian bin]# /opt/pbs/bin/qstat -a
> >
> > rufian.perrera.local:
> >
> > Req'd Req'd Elap
> > Job ID Username Queue Jobname
> > SessID
> > NDS TSK Memory Time S Time
> > -------------------- -------- --------
> > ---------------- ------
> > ----- --- ------ ----- - -----
> > 7.localhost.loca samir batch STDIN
> > -- 1 -- -- 01:00 Q --
> > 8.localhost.loca samir batch STDIN
> > -- 1 -- -- 01:00 Q --
> > 9.localhost.loca samir batch STDIN
> > -- 1 -- -- 01:00 Q --
> > 10.localhost.loc samir batch STDIN
> > -- 1 -- -- 01:00 Q --
> > [root at rufian bin]# /opt/pbs/bin/qdel
> > 7.localhost.localdomain
> > Connection timed out
> > qdel: cannot connect to server
> localhost.localdomain
> > (errno=110)
> > Connection timed out
> > You have new mail in /var/spool/mail/root
> > [root at rufian bin]# /opt/pbs/bin/qdel
> > 7.rufian.perrera.local
> > qdel: Unknown Job Id 7.rufian.perrera.local
> > [root at rufian bin]# su - samir
> > [samir at rufian ~]$ /opt/pbs/bin/qdel
> > 7.localhost.localdomain
> > Connection timed out
> > qdel: cannot connect to server
> localhost.localdomain
> > (errno=110)
> > Connection timed out
> > [samir at rufian ~]$
> >
> >
> >
> >
> >
> ------------------------------------------------------------------------
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> <mailto:torqueusers at supercluster.org>
> > <mailto:torqueusers at supercluster.org
> <mailto:torqueusers at supercluster.org>>
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >
> >
> >
> >
> >
> >
> >
> ------------------------------------------------------------------------
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
> > http://www.supercluster.org/mailman/listinfo/torqueusers
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20090521/b7f4d7ba/attachment-0001.html
More information about the torqueusers
mailing list