[torqueusers] my new pbs server is not working

shibo kuang s.b.kuang at gmail.com
Wed Mar 10 18:10:12 MST 2010


Hi Gus,
In default, I can submit job on nodes.
Now I still get the same errors as below when my pbs script tried to
resubmit jobs.
/var/spool/torque/mom_priv/jobs/127.master.SC: line 13: qsub: command not
found

It seems "qsub" cannot be recognized in pbs_script. However, if I use
/usr/local/bin/qsub, my script works successfully.

So how I can let pbs_script know the path of qsub?


Cheers,
Shibo Kuang

On Thu, Mar 11, 2010 at 10:12 AM, Gus Correa <gus at ldeo.columbia.edu> wrote:

> Hi Shibo
>
> Sorry, I forgot this important step.
> On your master node do this (you may need to do this as root,
> or using "su" or "sudo", unless the user shibo is also a Torque
> administrator):
>
> qmgr -c "set server allow_node_submit = True"
>
> to allow jobs to be submitted from all nodes,
> not only from the master.
>
> To confirm that the server configuration changed,
> do:
>
> qmgr -c "print server"
>
>
> Also:
>
> 1) From what you say, it looks like your qsub is in /usr/local/bin/qsub,
> not in /var/spool/torque/bin (my wrong guess).
> 2) There are no torque.sh and torque.csh files in /etc/profile.d.
> You would need to *create* them.
> However, this may not be necessary, as your Torque qsub command is
> installed on /usr/local/bin, which is likely to be in your PATH already.
>
>
> I hope this helps.
> Gus Correa
> ---------------------------------------------------------------------
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> ---------------------------------------------------------------------
>
>
> shibo kuang wrote:
>
>> Dear Gus,
>> thanks for your reply.
>> I am trying moving from windows to linux to do simulations, and thus not
>>  familar with linux things.
>> resubmission is not working both on master and node although the
>> submission works one time for both.
>> when I run "which qsub" on master and node, both get
>>  "/usr/local/bin/qsub".
>> using export to set the parth ( export PATH=/usr/local/bin:${PATH}) is not
>>  working. I cannot find torque.sh, thus cannot test the second method
>> suggested. there is no the folder "/var/spool/torque/bin". Insteresting, in
>> /var/spool/torque/pbs_environment, it gives "PATH=/bin:/usr/bin"
>> thanks again and your further suggestions would be greatly appreciated.
>> Cheers,
>> Shibo kuang
>>
>> On Thu, Mar 11, 2010 at 4:18 AM, Gus Correa <gus at ldeo.columbia.edu<mailto:
>> gus at ldeo.columbia.edu>> wrote:
>>
>>    Hi Shibo
>>
>>    Glad that your Torque/PBS is now working.
>>
>>    I would guess the problem you have now with job resubmission
>>    is related to your PATH environment variable.
>>    Somehow Linux cannot find qsub, and I suppose this happens in the
>>    slave node.
>>
>>    Does it happen in the master node also?
>>    What do you get if you login to the slave node and do "which qsub",
>>    or just "qsub"?
>>
>>    Again, this is not a Torque problem, more of a Sys Admin issue.
>>    A possible fix may depend a bit on where you installed Torque.
>>    Assuming it is installed in /var/spool/torque/,
>>    add /var/spool/torque/bin to your path,
>>    on your shell initialization script:
>>
>>    For csh/tcsh, in your .cshrc/.tcshrc
>>
>>    setenv PATH /var/spool/torque/bin:${PATH}
>>
>>    For sh/bash in .profile or maybe .bashrc
>>
>>    export PATH=/var/spool/torque/bin:${PATH}
>>
>>    An alternative is to add a torque.sh and a torque.csh file
>>    to the /etc/profile.d directory *on every node* with the
>>    contents above.
>>    (This may depend a bit on which Linux distribution you use.
>>    It works for Fedora, RedHat, and CentOS, may work for others too.)
>>
>>
>>    I hope this helps.
>>
>>    Gus Correa
>>    ---------------------------------------------------------------------
>>    Gustavo Correa
>>    Lamont-Doherty Earth Observatory - Columbia University
>>    Palisades, NY, 10964-8000 - USA
>>    ---------------------------------------------------------------------
>>
>>    shibo kuang wrote:
>>
>>        Hi All,
>>        Now my pbs server can work with the help of Gus Correa. My
>>        problem is due to the fact that I did mount my master folder to
>>        nodes. Here, i got another problem for automatically restarting
>>        a job.
>>        Below is my script
>>         #!/bin/bash
>>        #PBS -N inc90
>>        #PBS -q short
>>        #PBS -l walltime=00:08:00
>>        cd $PBS_O_WORKDIR
>>        ./nspff >out
>>        if [ -f jobfinished ]; then
>>           rm -f jobfinished
>>           exit 0
>>        fi
>>        sleep 10
>>        qsub case
>>         my code stops at 7min, it is supposed to get started
>>        automatically after 10s, but failed with the following error:
>>         /var/spool/torque/mom_priv/jobs/120.master.SC
>>        <http://120.master.sc/> <http://120.master.SC
>>        <http://120.master.sc/>>: line 13: qsub: command not found
>>
>>
>>         Your help would be greatly appreciated.
>>         Regards,
>>        Shibo Kuang
>>
>>           On Wed, Mar 10, 2010 at 2:57 AM, Gus Correa
>>        <gus at ldeo.columbia.edu <mailto:gus at ldeo.columbia.edu>
>>           <mailto:gus at ldeo.columbia.edu
>>        <mailto:gus at ldeo.columbia.edu>>> wrote:
>>
>>               Hi Shibo
>>
>>               Somehow your "slave" computer
>>               doesn't see /home/kuang/sharpbend/s1/r8,
>>               although it can be seen by the "master" computer.
>>               It may be one of several things,
>>               it is hard to tell exactly with the information you gave,
>>               but here are some guesses.
>>
>>               Do you really have a separate /home/kuang/sharpbend/s1/r8
>>               on your "slave" computer, or is it only present in the
>>        "master"?
>>               You can login to the "slave" and check this directly
>>               ("ls home/kuang/sharpbend/s1/r8").
>>               If the directory is not there,
>>               this is not really a Torque or MPI problem,
>>               but a Sys Admin problem with exporting and mounting
>>        directories.
>>
>>               If that directory exists only on the master side,
>>               you can either create an identical copy on the "slave" side
>>               (painful),
>>               or use NFS to export it from the "master" computer to the
>>               "slave" (easier).
>>
>>               For the second approach, you need to export the /home or
>>        /home/kuang
>>               on the "master" computer, and automount it on the "slave"
>>        computer.
>>               The files you need to edit are /etc/exports (master side),
>>               and /etc/auto.master plus perhaps /etc/auto.home (slave
>>        side).
>>
>>               A bit different approach (not using the automounter),
>>               is just to hard mount /home or /home/kuang
>>               on the "slave" side by adding it to the /etc/fstab list.
>>
>>               You also need to turn on the NFS daemon on the "master"
>>        node with
>>               "chkconfig", if it is not yet turned on.
>>
>>               Read the man pages!
>>               At least read "man exportfs", "man mountd", "man fstab",
>>               and "man chkconfig".
>>
>>               You may need to reboot the computers for this to take
>> effect.
>>               Then login to the "slave" and try again
>>               "ls home/kuang/sharpbend/s1/r8".
>>
>>               I hope this helps.
>>               Gus Correa
>>
>>  ---------------------------------------------------------------------
>>               Gustavo Correa
>>               Lamont-Doherty Earth Observatory - Columbia University
>>               Palisades, NY, 10964-8000 - USA
>>
>>  ---------------------------------------------------------------------
>>
>>               shibo kuang wrote:
>>
>>                   "/home/kuang/sharpbend/s1/r8: No such file or
>> directory."
>>                   my node does  not have the directory, but my master
>>        has it.
>>                                      On Sun, Mar 7, 2010 at 1:09 AM, shibo
>> kuang
>>                   <s.b.kuang at gmail.com <mailto:s.b.kuang at gmail.com>
>>        <mailto:s.b.kuang at gmail.com <mailto:s.b.kuang at gmail.com>>
>>                   <mailto:s.b.kuang at gmail.com
>>        <mailto:s.b.kuang at gmail.com> <mailto:s.b.kuang at gmail.com
>>        <mailto:s.b.kuang at gmail.com>>>>
>>
>>                   wrote:
>>
>>                      Hi,
>>                      I just fix the problem using password  free
>>        between the
>>                   computing
>>                      node and the master.
>>                      But now i got another problem:
>>                      in r8.e19, it says
>>                      /home/kuang/sharpbend/s1/r8: No such file or
>>        directory.
>>                      if only one computer is used, the sever can work
>>        normally.
>>                      Where is missed by me when I install the torque?
>>                      Your help would be greatly appreciated.
>>                      Cheers,
>>                      Shibo Kuang
>>
>>
>>                           On Sun, Mar 7, 2010 at 12:46 AM, shibo kuang
>>                   <s.b.kuang at gmail.com <mailto:s.b.kuang at gmail.com>
>>        <mailto:s.b.kuang at gmail.com <mailto:s.b.kuang at gmail.com>>
>>                      <mailto:s.b.kuang at gmail.com
>>        <mailto:s.b.kuang at gmail.com>
>>                   <mailto:s.b.kuang at gmail.com
>>        <mailto:s.b.kuang at gmail.com>>>> wrote:
>>
>>                          Hi all,
>>                          I tried to install a pbs server for my two
>>        centos linux
>>                          computers (each have 8 cores), but failed..
>>                          Here is my problem:
>>                          if i treat one computer as master for runnig
>>                   pbs_server, as well
>>                          as a computing node. I can submit jobs using
>>        script
>>                   without any
>>                          problem. All jobs give the exact results.
>>                   However, when one computer is treated as a
>>        master, and
>>                          another is a compting node. jobs ara never
>>        submitted
>>                   sucessfully.
>>                          I would appreciate your hints and suggestions
>>                   according the
>>                          following prompts i got.
>>                          Regards,
>>                          Shibo Kuang
>>                                   Return-Path: <adm at master
>>        <mailto:adm at master <mailto:adm at master>
>>                   <mailto:adm at master <mailto:adm at master>>>>
>>
>>                          Received: from master (localhost [127.0.0.1])
>>                                  by master (8.13.1/8.13.1) with ESMTP id
>>                   o26DwKF9006310
>>                                  for <kuang at master <mailto:kuang at master
>>        <mailto:kuang at master>
>>                   <mailto:kuang at master <mailto:kuang at master>>>>; Sun, 7
>> Mar
>>
>>                          2010 00:28:20 +1030
>>                          Received: (from root at localhost
>>        <mailto:root at localhost <mailto:root at localhost>
>>                   <mailto:root at localhost <mailto:root at localhost>>>)
>>
>>                                  by master (8.13.1/8.13.1/Submit) id
>>                   o26DwKpZ006293
>>                                  for kuang at master <mailto:kuang at master
>>        <mailto:kuang at master>
>>                   <mailto:kuang at master <mailto:kuang at master>>>; Sun, 7
>>        Mar 2010
>>
>>                          00:28:20 +1030
>>                          Date: Sun, 7 Mar 2010 00:28:20 +1030
>>                          From: adm <adm at master <mailto:adm at master
>>        <mailto:adm at master>
>>                   <mailto:adm at master <mailto:adm at master>>>>
>>
>>                          Message-Id: <201003061358.o26DwKpZ006293 at master
>>                          <mailto:201003061358.o26DwKpZ006293 at master
>>        <mailto:201003061358.o26DwKpZ006293 at master>
>>                   <mailto:201003061358.o26DwKpZ006293 at master
>>        <mailto:201003061358.o26DwKpZ006293 at master>>>>
>>                          To: kuang at master <mailto:kuang at master
>>        <mailto:kuang at master>
>>                   <mailto:kuang at master <mailto:kuang at master>>>
>>
>>                          Subject: PBS JOB 18.master
>>                          Precedence: bulk
>>                          PBS Job Id: 18.master
>>                          Job Name:   r8
>>                          Exec host:  par1/0
>>                          An error has occurred processing your job, see
>>        below.
>>                          Post job file processing error; job 18.master
>>        on host
>>                   par1/0
>>                          Unable to copy file
>>                   /var/spool/torque/spool/18.master.OU to
>>                          kuang at master:/home/kuang/sharpbend/s1/r8/r8.o18
>>                          <mailto:kuang at master <mailto:kuang at master>
>>                   <mailto:kuang at master
>>        <mailto:kuang at master>>:/home/kuang/sharpbend/s1/r8/r8.o18>
>>
>>                          *** error from copy
>>                          Permission denied
>>        (publickey,gssapi-with-mic,password).
>>                          lost connection
>>                          *** end error output
>>                          Output retained on that host in:
>>                          /var/spool/torque/undelivered/18.master.OU
>>                          Unable to copy file
>>                   /var/spool/torque/spool/18.master.ER
>>        <http://18.master.er/> <http://18.master.er/>
>>                          <http://18.master.er/> to
>>
>>                          kuang at master:/home/kuang/sharpbend/s1/r8/r8.e18
>>                          <mailto:kuang at master <mailto:kuang at master>
>>                   <mailto:kuang at master
>>        <mailto:kuang at master>>:/home/kuang/sharpbend/s1/r8/r8.e18>
>>
>>                          *** error from copy
>>                          Permission denied
>>        (publickey,gssapi-with-mic,password).
>>                          lost connection
>>                          *** end error output
>>                          Output retained on that host in:
>>                          /var/spool/torque/undelivered/18.master.ER
>>        <http://18.master.er/>
>>                   <http://18.master.er/> <http://18.master.er/>
>>
>>
>>
>>
>>
>>  ------------------------------------------------------------------------
>>
>>
>>
>>                   _______________________________________________
>>                   torqueusers mailing list
>>                   torqueusers at supercluster.org
>>        <mailto:torqueusers at supercluster.org>
>>                   <mailto:torqueusers at supercluster.org
>>
>>        <mailto:torqueusers at supercluster.org>>
>>
>>
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>>
>>
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100311/b95ff612/attachment-0001.html 


More information about the torqueusers mailing list