[torqueusers] wrong pbs server name

Samir Gartner jigzat at gmail.com
Thu May 21 14:40:45 MDT 2009


Sorry for being so insistent but I have like 8 days to finish my graduation
proyect.

It seems that the jobs now try to start but there is no output except an
email that says someting like this:

>From adm at rufian.perrera.local  Tue Apr 21 15:28:59 2009
Return-Path: <adm at rufian.perrera.local>
Received: from rufian.perrera.local (rufian.perrera.local [127.0.0.1])
        by rufian.perrera.local (8.13.8/8.13.8) with ESMTP id n3LKSwDi006753
        for <samir at rufian.perrera.local>; Tue, 21 Apr 2009 15:28:59 -0500
Received: (from root at localhost)
        by rufian.perrera.local (8.13.8/8.13.8/Submit) id n3LKSwro006752
        for samir at rufian.perrera.local; Tue, 21 Apr 2009 15:28:58 -0500
Date: Tue, 21 Apr 2009 15:28:58 -0500
From: adm <adm at rufian.perrera.local>
Message-Id: <200904212028.n3LKSwro006752 at rufian.perrera.local>
To: samir at rufian.perrera.local
Subject: PBS JOB 21.rufian.perrera.local

PBS Job Id: 21.rufian.perrera.local
Job Name:   STDIN
Exec host:  rufian.perrera.local/0
Aborted by PBS Server
Job does not exist on node

2009/5/21 Samir Gartner <jigzat at gmail.com>

> Ok Gus and everyone. Thanks again for your answers.
>
> There is no pbs_sched on /etc/init.d but it is here:
>
> /usr/local/src/torque-2.3.6/contrib/init.d/pbs_sched
> /usr/local/src/torque-2.3.6/tpackages/server/opt/pbs/sbin/pbs_sched
> /usr/local/src/torque-2.3.6/src/scheduler.cc/.libs/pbs_sched
> /usr/local/src/torque-2.3.6/src/scheduler.cc/pbs_sched
> /opt/pbs/sbin/pbs_sched
>
> I was thinking copying /opt/pbs/sbin/pbs_sched to /etc/init.d. Is it right
> to do that?
>
> Sorry about the "manually" word. It is local slang I guess. What I mean is
> that I went to the /opt/pbs/sbin/ folder and executed ./pbs_sched
>
> hostname output is:
>
> rufian.perrera.local
>
> hosts file contain:
>
> # Do not remove the following line, or various programs
> # that require network functionality will fail.
> #127.0.0.1              localhost.localdomain localhost
> <--------------------------Is this wrong?
> ::1             localhost6.localdomain6 localhost6
> 127.0.0.1 rufian.perrera.local rufian
> 192.168.2.6 auyin.perrera.local auyin
> 192.168.2.4 pelusa.perrera.local pelusa
> 192.168.2.2 lamparita.perrera.local lamparita
>
>
> network content is:
>
> NETWORKING=yes
> HOSTNAME=rufian.perrera.local
> DOMAINNAME=perrera.local
>
> I dont have /etc/sysconfig/pbs_server nor /etc/sysconfig/pbs_sched either
>
>
> 2009/5/21 Gus Correa <gus at ldeo.columbia.edu>
>
> Samir Gartner wrote:
>> > Ok, scheduling wasn't enabled,now it is,
>>
>> It happens very often.
>> Fixing it is a good first step.
>>
>> > but pbs_sched service was not
>> > found.
>>
>> Starting up daemons in YDog may be different from RHEL, CentOS, Fedora,
>> so I am just guessing based on the latter. Not familiar to YDog.
>> Anyway ...
>>
>> Don't know if you got Torque from ClusterResources or other.
>> In any case, there should be a pbs_sched script on /etc/init.d
>> If it is there, do "chkconfig --add pbs_sched" (or YDog equivalent),
>> then do "chkconfig --list pbs_sched" to see which runlevels it will be
>> on, then "service pbs_sched start" to start it, or if YDog doesn't have
>> "service", run it with "/etc/init.d/pbs_sched start".
>>
>> If you don't have the pbs_sched script in /etc/init.d, you may find one
>> in the contrib subdirectory of the Torque source tree.
>> Copy it over to /etc/init.d, and do the above.
>> (The location may be other than /etc/init.d in YDog.)
>>
>>
>> > I didn't install maui, it is a default installation. About hosts
>> > file, it is properly configured as well as nodes and mom's config files.
>> >
>>
>> You only need Maui if you want a complex scheduling policy.
>> pbs_sched is FIFO, very simple, but works fine.
>> I've used it for a long time without problems.
>>
>> > when I manually start pbs_sched it says
>> >
>> > pbs_sched: addclient, host localhost not found
>> >
>>
>> Hmm ... never got this one, not that I remember.
>> Not sure what you mean by "manually start pbs_sched".
>> Anyway, sounds as another, different, problem.
>>
>>
>> Is it possible that your "hostname" command
>> is not resolving your server name to rufian.perrera.local but to
>> localhost?
>> What is the output of "hostname"?
>> What do you have in /etc/hosts?
>> What do you have in /etc/sysconfig/network?
>>
>> Just in case you have  /etc/sysconfig/pbs_server and
>> /etc/sysconfig/pbs_sched, what is the contents?
>> (I don't have them.)
>>
>> (Again just guessing, YDog may have different files to startup things.)
>>
>> I hope this helps,
>> Gus Correa
>> ---------------------------------------------------------------------
>> Gustavo Correa
>> Lamont-Doherty Earth Observatory - Columbia University
>> Palisades, NY, 10964-8000 - USA
>> ---------------------------------------------------------------------
>>
>> >
>> > 2009/5/21 Samir Gartner <jigzat at gmail.com <mailto:jigzat at gmail.com>>
>> >
>> >     I think I'm gonna cry.... I love you guys!! No, seriously, it worked
>> >     but only if executed under root user, now the question is what did I
>> >     do wrong? Jobs should start automatically, right?
>> >
>> >     I was following first the Globus tootlikt tutorial but it is kinda
>> >     outdated so I guess I issued some wrong instructions.
>> >
>> >     On of the weird things was that the tutorial suggested using the
>> >     /opt/pbs prefix when executing configure and now I have under
>> >     /opt/pbs again a /opt/pbs folder with repeated bin and sbin folders
>> >     and executables. Is this wrong or is how it is supposed to be?
>> >
>> >     2009/5/21 Ling C. Ho <ling at fnal.gov <mailto:ling at fnal.gov>>
>> >
>> >         Have you configured a scheduler?
>> >
>> >         What if you use qrun. Would any job starts?
>> >
>> >         ...
>> >         ling
>> >
>> >         Samir Gartner wrote:
>> >
>> >             Ok, I don't see any file named default_server but
>> >             server_name has the right server name rufian.perrera.local
>> >             and there is another file with the same content named
>> >             server_name.new.
>> >
>> >             Righ now the PSB server name apears to be correct (after
>> >             stoping the server and manually deletting the zombie jobs)
>> >             but stil the jobs won't start.
>> >
>> >
>> >             [samir at rufian ~]$ echo "sleep 30;date" | /opt/pbs/bin/qsub
>> >             [samir at rufian ~]$ /opt/pbs/bin/qstat -a
>> >
>> >             rufian.perrera.local:
>> >
>> >                         Req'd  Req'd   Elap
>> >             Job ID               Username Queue    Jobname
>> >              SessID NDS   TSK Memory Time  S Time
>> >             -------------------- -------- -------- ----------------
>> >             ------ ----- --- ------ ----- - -----
>> >             13.rufian.perrer     samir    batch    STDIN
>> >             --      1  --    --  01:00 Q   --
>> >             [samir at rufian ~]$
>> >
>> >
>> >             by the way, is it top posting allowed??
>> >
>> >             2009/5/21 Jerry Smith <jdsmit at sandia.gov
>> >             <mailto:jdsmit at sandia.gov> <mailto:jdsmit at sandia.gov
>> >             <mailto:jdsmit at sandia.gov>>>
>> >
>> >
>> >                Samir,
>> >
>> >                What do you have in
>> $PBS_HOME/{server_name,default_server}?
>> >
>> >                It should be what resolves as the ethernet address that
>> >             pbs should
>> >                be listening on.
>> >
>> >                --Jerry
>> >
>> >
>> >
>> >
>> >                Samir Gartner wrote:
>> >
>> >                    Ok I finally installed torque under yellowdog/ppc but
>> >             now I have
>> >                    another problem. I set up my pbs server as
>> >             rufian.perrera.local
>> >                    but when I issue a job it shows itself in
>> >             localhost.localdomain
>> >                    and it stays on queued state forever. And if i try to
>> >             qdel the
>> >                    job it cant reach the server and the conection times
>> >             out. Any
>> >                    ideas of what could be wrong?
>> >                    I'm not trying to set up anything complicated, is
>> >             just one
>> >                    machine that works as server and client.
>> >
>> >                    this is the shell output
>> >
>> >                    [root at rufian bin]# /opt/pbs/bin/qstat -a
>> >
>> >                    rufian.perrera.local:
>> >
>> >                                      Req'd  Req'd   Elap
>> >                    Job ID               Username Queue    Jobname
>> >                SessID
>> >                    NDS   TSK Memory Time  S Time
>> >                    -------------------- -------- --------
>> >             ---------------- ------
>> >                    ----- --- ------ ----- - -----
>> >                    7.localhost.loca     samir    batch    STDIN
>> >                   --             1  --    --  01:00 Q   --
>> >                    8.localhost.loca     samir    batch    STDIN
>> >                   --             1  --    --  01:00 Q   --
>> >                    9.localhost.loca     samir    batch    STDIN
>> >                   --             1  --    --  01:00 Q   --
>> >                    10.localhost.loc     samir    batch    STDIN
>> >                   --             1  --    --  01:00 Q   --
>> >                    [root at rufian bin]# /opt/pbs/bin/qdel
>> >             7.localhost.localdomain
>> >                    Connection timed out
>> >                    qdel: cannot connect to server localhost.localdomain
>> >             (errno=110)
>> >                    Connection timed out
>> >                    You have new mail in /var/spool/mail/root
>> >                    [root at rufian bin]# /opt/pbs/bin/qdel
>> >             7.rufian.perrera.local
>> >                    qdel: Unknown Job Id 7.rufian.perrera.local
>> >                    [root at rufian bin]# su - samir
>> >                    [samir at rufian ~]$ /opt/pbs/bin/qdel
>> >             7.localhost.localdomain
>> >                    Connection timed out
>> >                    qdel: cannot connect to server localhost.localdomain
>> >             (errno=110)
>> >                    Connection timed out
>> >                    [samir at rufian ~]$
>> >
>> >
>> >
>> >
>> >
>> ------------------------------------------------------------------------
>> >
>> >             _______________________________________________
>> >             torqueusers mailing list
>> >             torqueusers at supercluster.org
>> >             <mailto:torqueusers at supercluster.org>
>> >             http://www.supercluster.org/mailman/listinfo/torqueusers
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > ------------------------------------------------------------------------
>> >
>> > _______________________________________________
>> > torqueusers mailing list
>> > torqueusers at supercluster.org
>> > http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20090521/d425d165/attachment-0001.html 


More information about the torqueusers mailing list