[torqueusers] my new pbs server is not working

shibo kuang s.b.kuang at gmail.com
Wed Mar 10 22:14:35 MST 2010


Dear Gus,
I just solved my problem.
I forget to set the path to one node which I add it to my cluster this
morning.
Sorry for this silly mistakes.
Thank you for your great help.
Cheers,
Shibo Kuang


On Thu, Mar 11, 2010 at 3:41 PM, shibo kuang <s.b.kuang at gmail.com> wrote:

> Dear Gus,
> Sorry for this trouble and thank you for your help.
> See my replies below.
> On Thu, Mar 11, 2010 at 2:45 PM, Gus Correa <gus at ldeo.columbia.edu> wrote:
>
>> Hi Shibo
>>
>> You still seem to have a problem with your PATH on the nodes.
>>
>
> Even I  use only one computer to setup torque, I still have the same
> problem.
>
>
>
>>
>> Just to make sure I understood it right.
>>
>> 1) Did you apply the qmgr command to allow (re)submission
>> of jobs from the nodes that I sent you in the previous email?
>>
>
> Yes, I have already done it.
>
>
>>
>> 2) Did you change your .cshrc/.tcshrc or .profile/.bashrc file
>> to set your PATH to include /usr/local/bin
>> (if it is not already part of your PATH)?
>>
>>
> Yes, I have already done it using a user account
>
>
>
>> You need to do this on all nodes (unless your home directory is
>> exported from the master node to the other nodes).
>> Or use the alternative I mentioned on /etc/profile.d/torque.[sh,csh],
>> also on all nodes.
>>
>
> You mean put  "export PATH=/usr/local/bin/:${PATH}" in
> /etc/profile.d/torque.sh?
> My system is bsh based.
>  I tried it, and failed again.
>
>
> Cheers,
> Shibo
>
>
>
>
>>
>> 3) You can always use the full path name to qsub on your Torque/PBS
>> script, if you want to resubmit.
>> However, this is not really needed if your path is set correctly,
>> say, by one of the methods in item 2).
>>
>>
>> I hope this helps,
>> Gus Correa
>> ---------------------------------------------------------------------
>> Gustavo Correa
>> Lamont-Doherty Earth Observatory - Columbia University
>> Palisades, NY, 10964-8000 - USA
>> ---------------------------------------------------------------------
>>
>> shibo kuang wrote:
>>
>>> Hi Gus,
>>> In default, I can submit job on nodes.
>>> Now I still get the same errors as below when my pbs script tried to
>>> resubmit jobs.
>>> /var/spool/torque/mom_priv/jobs/127.master.SC <http://127.master.sc/> <
>>> http://127.master.SC <http://127.master.sc/>>: line 13: qsub: command
>>> not found
>>>
>>>
>>> It seems "qsub" cannot be recognized in pbs_script. However, if I use
>>> /usr/local/bin/qsub, my script works successfully.
>>>
>>> So how I can let pbs_script know the path of qsub?
>>>
>>>
>>> Cheers,
>>> Shibo Kuang
>>>
>>>  On Thu, Mar 11, 2010 at 10:12 AM, Gus Correa <gus at ldeo.columbia.edu<mailto:
>>> gus at ldeo.columbia.edu>> wrote:
>>>
>>>    Hi Shibo
>>>
>>>    Sorry, I forgot this important step.
>>>    On your master node do this (you may need to do this as root,
>>>    or using "su" or "sudo", unless the user shibo is also a Torque
>>>    administrator):
>>>
>>>    qmgr -c "set server allow_node_submit = True"
>>>
>>>    to allow jobs to be submitted from all nodes,
>>>    not only from the master.
>>>
>>>    To confirm that the server configuration changed,
>>>    do:
>>>
>>>    qmgr -c "print server"
>>>
>>>
>>>    Also:
>>>
>>>    1) From what you say, it looks like your qsub is in
>>> /usr/local/bin/qsub,
>>>    not in /var/spool/torque/bin (my wrong guess).
>>>    2) There are no torque.sh and torque.csh files in /etc/profile.d.
>>>    You would need to *create* them.
>>>    However, this may not be necessary, as your Torque qsub command is
>>>    installed on /usr/local/bin, which is likely to be in your PATH
>>> already.
>>>
>>>
>>>    I hope this helps.
>>>    Gus Correa
>>>    ---------------------------------------------------------------------
>>>    Gustavo Correa
>>>    Lamont-Doherty Earth Observatory - Columbia University
>>>    Palisades, NY, 10964-8000 - USA
>>>    ---------------------------------------------------------------------
>>>
>>>
>>>    shibo kuang wrote:
>>>
>>>        Dear Gus,
>>>        thanks for your reply.
>>>        I am trying moving from windows to linux to do simulations, and
>>>        thus not  familar with linux things.
>>>        resubmission is not working both on master and node although the
>>>        submission works one time for both.
>>>        when I run "which qsub" on master and node, both get
>>>         "/usr/local/bin/qsub".
>>>        using export to set the parth ( export
>>>        PATH=/usr/local/bin:${PATH}) is not  working. I cannot find
>>>        torque.sh, thus cannot test the second method suggested. there
>>>        is no the folder "/var/spool/torque/bin". Insteresting, in
>>>        /var/spool/torque/pbs_environment, it gives "PATH=/bin:/usr/bin"
>>>        thanks again and your further suggestions would be greatly
>>>        appreciated.
>>>        Cheers,
>>>        Shibo kuang
>>>                 On Thu, Mar 11, 2010 at 4:18 AM, Gus Correa
>>>        <gus at ldeo.columbia.edu <mailto:gus at ldeo.columbia.edu>
>>>         <mailto:gus at ldeo.columbia.edu <mailto:gus at ldeo.columbia.edu>>>
>>>        wrote:
>>>
>>>           Hi Shibo
>>>
>>>            Glad that your Torque/PBS is now working.
>>>
>>>           I would guess the problem you have now with job resubmission
>>>           is related to your PATH environment variable.
>>>           Somehow Linux cannot find qsub, and I suppose this happens in
>>> the
>>>           slave node.
>>>
>>>           Does it happen in the master node also?
>>>           What do you get if you login to the slave node and do "which
>>>        qsub",
>>>           or just "qsub"?
>>>
>>>           Again, this is not a Torque problem, more of a Sys Admin issue.
>>>           A possible fix may depend a bit on where you installed Torque.
>>>           Assuming it is installed in /var/spool/torque/,
>>>           add /var/spool/torque/bin to your path,
>>>           on your shell initialization script:
>>>
>>>           For csh/tcsh, in your .cshrc/.tcshrc
>>>
>>>           setenv PATH /var/spool/torque/bin:${PATH}
>>>
>>>           For sh/bash in .profile or maybe .bashrc
>>>
>>>           export PATH=/var/spool/torque/bin:${PATH}
>>>
>>>           An alternative is to add a torque.sh and a torque.csh file
>>>           to the /etc/profile.d directory *on every node* with the
>>>           contents above.
>>>           (This may depend a bit on which Linux distribution you use.
>>>           It works for Fedora, RedHat, and CentOS, may work for others
>>>        too.)
>>>
>>>
>>>           I hope this helps.
>>>
>>>           Gus Correa
>>>
>>>  ---------------------------------------------------------------------
>>>           Gustavo Correa
>>>           Lamont-Doherty Earth Observatory - Columbia University
>>>           Palisades, NY, 10964-8000 - USA
>>>
>>>  ---------------------------------------------------------------------
>>>
>>>           shibo kuang wrote:
>>>
>>>               Hi All,
>>>               Now my pbs server can work with the help of Gus Correa. My
>>>               problem is due to the fact that I did mount my master
>>>        folder to
>>>               nodes. Here, i got another problem for automatically
>>>        restarting
>>>               a job.
>>>               Below is my script
>>>                #!/bin/bash
>>>               #PBS -N inc90
>>>               #PBS -q short
>>>               #PBS -l walltime=00:08:00
>>>               cd $PBS_O_WORKDIR
>>>               ./nspff >out
>>>               if [ -f jobfinished ]; then
>>>                  rm -f jobfinished
>>>                  exit 0
>>>               fi
>>>               sleep 10
>>>               qsub case
>>>                my code stops at 7min, it is supposed to get started
>>>               automatically after 10s, but failed with the following
>>> error:
>>>                /var/spool/torque/mom_priv/jobs/120.master.SC<http://120.master.sc/>
>>>        <http://120.master.SC <http://120.master.sc/>>
>>>
>>>               <http://120.master.sc/> <http://120.master.SC<http://120.master.sc/>
>>>               <http://120.master.sc/>>: line 13: qsub: command not found
>>>
>>>
>>>                Your help would be greatly appreciated.
>>>                Regards,
>>>               Shibo Kuang
>>>                                           On Wed, Mar 10, 2010 at 2:57
>>> AM, Gus Correa
>>>               <gus at ldeo.columbia.edu <mailto:gus at ldeo.columbia.edu>
>>>        <mailto:gus at ldeo.columbia.edu <mailto:gus at ldeo.columbia.edu>>
>>>                  <mailto:gus at ldeo.columbia.edu
>>>        <mailto:gus at ldeo.columbia.edu>
>>>               <mailto:gus at ldeo.columbia.edu
>>>        <mailto:gus at ldeo.columbia.edu>>>> wrote:
>>>
>>>                      Hi Shibo
>>>
>>>                      Somehow your "slave" computer
>>>                      doesn't see /home/kuang/sharpbend/s1/r8,
>>>                      although it can be seen by the "master" computer.
>>>                      It may be one of several things,
>>>                      it is hard to tell exactly with the information
>>>        you gave,
>>>                      but here are some guesses.
>>>
>>>                      Do you really have a separate
>>>        /home/kuang/sharpbend/s1/r8
>>>                      on your "slave" computer, or is it only present in
>>> the
>>>               "master"?
>>>                      You can login to the "slave" and check this directly
>>>                      ("ls home/kuang/sharpbend/s1/r8").
>>>                      If the directory is not there,
>>>                      this is not really a Torque or MPI problem,
>>>                      but a Sys Admin problem with exporting and mounting
>>>               directories.
>>>
>>>                      If that directory exists only on the master side,
>>>                      you can either create an identical copy on the
>>>        "slave" side
>>>                      (painful),
>>>                      or use NFS to export it from the "master" computer
>>>        to the
>>>                      "slave" (easier).
>>>
>>>                      For the second approach, you need to export the
>>>        /home or
>>>               /home/kuang
>>>                      on the "master" computer, and automount it on the
>>>        "slave"
>>>               computer.
>>>                      The files you need to edit are /etc/exports
>>>        (master side),
>>>                      and /etc/auto.master plus perhaps /etc/auto.home
>>>        (slave
>>>               side).
>>>
>>>                      A bit different approach (not using the
>>> automounter),
>>>                      is just to hard mount /home or /home/kuang
>>>                      on the "slave" side by adding it to the /etc/fstab
>>>        list.
>>>
>>>                      You also need to turn on the NFS daemon on the
>>>        "master"
>>>               node with
>>>                      "chkconfig", if it is not yet turned on.
>>>
>>>                      Read the man pages!
>>>                      At least read "man exportfs", "man mountd", "man
>>>        fstab",
>>>                      and "man chkconfig".
>>>
>>>                      You may need to reboot the computers for this to
>>>        take effect.
>>>                      Then login to the "slave" and try again
>>>                      "ls home/kuang/sharpbend/s1/r8".
>>>
>>>                      I hope this helps.
>>>                      Gus Correa
>>>
>>>  ---------------------------------------------------------------------
>>>                      Gustavo Correa
>>>                      Lamont-Doherty Earth Observatory - Columbia
>>> University
>>>                      Palisades, NY, 10964-8000 - USA
>>>
>>>  ---------------------------------------------------------------------
>>>
>>>                      shibo kuang wrote:
>>>
>>>                          "/home/kuang/sharpbend/s1/r8: No such file or
>>>        directory."
>>>                          my node does  not have the directory, but my
>>>        master
>>>               has it.
>>>                                             On Sun, Mar 7, 2010 at 1:09
>>>        AM, shibo kuang
>>>                          <s.b.kuang at gmail.com
>>>        <mailto:s.b.kuang at gmail.com> <mailto:s.b.kuang at gmail.com
>>>        <mailto:s.b.kuang at gmail.com>>
>>>               <mailto:s.b.kuang at gmail.com <mailto:s.b.kuang at gmail.com>
>>>        <mailto:s.b.kuang at gmail.com <mailto:s.b.kuang at gmail.com>>>
>>>                          <mailto:s.b.kuang at gmail.com
>>>        <mailto:s.b.kuang at gmail.com>
>>>               <mailto:s.b.kuang at gmail.com <mailto:s.b.kuang at gmail.com>>
>>>        <mailto:s.b.kuang at gmail.com <mailto:s.b.kuang at gmail.com>
>>>               <mailto:s.b.kuang at gmail.com <mailto:s.b.kuang at gmail.com>>>>>
>>>
>>>
>>>
>>>                          wrote:
>>>
>>>                             Hi,
>>>                             I just fix the problem using password  free
>>>               between the
>>>                          computing
>>>                             node and the master.
>>>                             But now i got another problem:
>>>                             in r8.e19, it says
>>>                             /home/kuang/sharpbend/s1/r8: No such file or
>>>               directory.
>>>                             if only one computer is used, the sever can
>>>        work
>>>               normally.
>>>                             Where is missed by me when I install the
>>>        torque?
>>>                             Your help would be greatly appreciated.
>>>                             Cheers,
>>>                             Shibo Kuang
>>>
>>>
>>>                                  On Sun, Mar 7, 2010 at 12:46 AM, shibo
>>>        kuang
>>>                          <s.b.kuang at gmail.com
>>>        <mailto:s.b.kuang at gmail.com> <mailto:s.b.kuang at gmail.com
>>>        <mailto:s.b.kuang at gmail.com>>
>>>               <mailto:s.b.kuang at gmail.com <mailto:s.b.kuang at gmail.com>
>>>        <mailto:s.b.kuang at gmail.com <mailto:s.b.kuang at gmail.com>>>
>>>                             <mailto:s.b.kuang at gmail.com
>>>        <mailto:s.b.kuang at gmail.com>
>>>               <mailto:s.b.kuang at gmail.com <mailto:s.b.kuang at gmail.com>>
>>>                          <mailto:s.b.kuang at gmail.com
>>>        <mailto:s.b.kuang at gmail.com>
>>>               <mailto:s.b.kuang at gmail.com
>>>          <mailto:s.b.kuang at gmail.com>>>>> wrote:
>>>
>>>                                 Hi all,
>>>                                 I tried to install a pbs server for my
>>> two
>>>               centos linux
>>>                                 computers (each have 8 cores), but
>>> failed..
>>>                                 Here is my problem:
>>>                                 if i treat one computer as master for
>>>        runnig
>>>                          pbs_server, as well
>>>                                 as a computing node. I can submit jobs
>>>        using
>>>               script
>>>                          without any
>>>                                 problem. All jobs give the exact
>>>        results.                           However, when one computer is
>>>        treated as a
>>>               master, and
>>>                                 another is a compting node. jobs ara
>>> never
>>>               submitted
>>>                          sucessfully.
>>>                                 I would appreciate your hints and
>>>        suggestions
>>>                          according the
>>>                                 following prompts i got.
>>>                                 Regards,
>>>                                 Shibo Kuang
>>>                                          Return-Path: <adm at master
>>>               <mailto:adm at master <mailto:adm at master> <mailto:adm at master
>>>        <mailto:adm at master>>
>>>                          <mailto:adm at master <mailto:adm at master>
>>>        <mailto:adm at master <mailto:adm at master>>>>>
>>>
>>>                                 Received: from master (localhost
>>>        [127.0.0.1])
>>>                                         by master (8.13.1/8.13.1) with
>>>        ESMTP id
>>>                          o26DwKF9006310
>>>                                         for <kuang at master
>>>        <mailto:kuang at master <mailto:kuang at master>
>>>               <mailto:kuang at master <mailto:kuang at master>>
>>>                          <mailto:kuang at master <mailto:kuang at master>
>>>        <mailto:kuang at master <mailto:kuang at master>>>>>; Sun, 7 Mar
>>>
>>>                                 2010 00:28:20 +1030
>>>                                 Received: (from root at localhost
>>>               <mailto:root at localhost <mailto:root at localhost>
>>>        <mailto:root at localhost <mailto:root at localhost>>
>>>                          <mailto:root at localhost <mailto:root at localhost>
>>>        <mailto:root at localhost <mailto:root at localhost>>>>)
>>>
>>>                                         by master (8.13.1/8.13.1/Submit)
>>> id
>>>                          o26DwKpZ006293
>>>                                         for kuang at master
>>>        <mailto:kuang at master <mailto:kuang at master>
>>>               <mailto:kuang at master <mailto:kuang at master>>
>>>                          <mailto:kuang at master <mailto:kuang at master>
>>>        <mailto:kuang at master <mailto:kuang at master>>>>; Sun, 7
>>>               Mar 2010
>>>
>>>                                 00:28:20 +1030
>>>                                 Date: Sun, 7 Mar 2010 00:28:20 +1030
>>>                                 From: adm <adm at master
>>>        <mailto:adm at master <mailto:adm at master>
>>>               <mailto:adm at master <mailto:adm at master>>
>>>                          <mailto:adm at master <mailto:adm at master>
>>>        <mailto:adm at master <mailto:adm at master>>>>>
>>>
>>>                                 Message-Id:
>>>        <201003061358.o26DwKpZ006293 at master
>>>                                        <mailto:
>>> 201003061358.o26DwKpZ006293 at master
>>>        <mailto:201003061358.o26DwKpZ006293 at master>
>>>               <mailto:201003061358.o26DwKpZ006293 at master
>>>        <mailto:201003061358.o26DwKpZ006293 at master>>
>>>                          <mailto:201003061358.o26DwKpZ006293 at master
>>>        <mailto:201003061358.o26DwKpZ006293 at master>
>>>               <mailto:201003061358.o26DwKpZ006293 at master
>>>        <mailto:201003061358.o26DwKpZ006293 at master>>>>>
>>>                                 To: kuang at master <mailto:kuang at master
>>>        <mailto:kuang at master>
>>>               <mailto:kuang at master <mailto:kuang at master>>
>>>                          <mailto:kuang at master <mailto:kuang at master>
>>>        <mailto:kuang at master <mailto:kuang at master>>>>
>>>
>>>                                 Subject: PBS JOB 18.master
>>>                                 Precedence: bulk
>>>                                 PBS Job Id: 18.master
>>>                                 Job Name:   r8
>>>                                 Exec host:  par1/0
>>>                                 An error has occurred processing your
>>>        job, see
>>>               below.
>>>                                 Post job file processing error; job
>>>        18.master
>>>               on host
>>>                          par1/0
>>>                                 Unable to copy file
>>>                          /var/spool/torque/spool/18.master.OU to
>>>                                        kuang at master
>>> :/home/kuang/sharpbend/s1/r8/r8.o18
>>>                                 <mailto:kuang at master
>>>        <mailto:kuang at master> <mailto:kuang at master <mailto:kuang at master>>
>>>                          <mailto:kuang at master <mailto:kuang at master>
>>>               <mailto:kuang at master
>>>        <mailto:kuang at master>>>:/home/kuang/sharpbend/s1/r8/r8.o18>
>>>
>>>                                 *** error from copy
>>>                                 Permission denied
>>>               (publickey,gssapi-with-mic,password).
>>>                                 lost connection
>>>                                 *** end error output
>>>                                 Output retained on that host in:
>>>
>>> /var/spool/torque/undelivered/18.master.OU
>>>                                 Unable to copy file
>>>                          /var/spool/torque/spool/18.master.ER<http://18.master.er/>
>>>        <http://18.master.ER <http://18.master.er/>>
>>>
>>>               <http://18.master.er/> <http://18.master.er/>
>>>                                 <http://18.master.er/> to
>>>
>>>                                        kuang at master
>>> :/home/kuang/sharpbend/s1/r8/r8.e18
>>>                                 <mailto:kuang at master
>>>        <mailto:kuang at master> <mailto:kuang at master <mailto:kuang at master>>
>>>                          <mailto:kuang at master <mailto:kuang at master>
>>>               <mailto:kuang at master
>>>        <mailto:kuang at master>>>:/home/kuang/sharpbend/s1/r8/r8.e18>
>>>
>>>                                 *** error from copy
>>>                                 Permission denied
>>>               (publickey,gssapi-with-mic,password).
>>>                                 lost connection
>>>                                 *** end error output
>>>                                 Output retained on that host in:
>>>                                        /var/spool/torque/undelivered/
>>> 18.master.ER <http://18.master.er/> <http://18.master.ER<http://18.master.er/>>
>>>
>>>
>>>               <http://18.master.er/>
>>>                          <http://18.master.er/> <http://18.master.er/>
>>>
>>>
>>>
>>>
>>>
>>>  ------------------------------------------------------------------------
>>>
>>>
>>>
>>>                          _______________________________________________
>>>                          torqueusers mailing list
>>>                          torqueusers at supercluster.org
>>>        <mailto:torqueusers at supercluster.org>
>>>               <mailto:torqueusers at supercluster.org
>>>        <mailto:torqueusers at supercluster.org>>
>>>                          <mailto:torqueusers at supercluster.org
>>>        <mailto:torqueusers at supercluster.org>
>>>
>>>               <mailto:torqueusers at supercluster.org
>>>        <mailto:torqueusers at supercluster.org>>>
>>>
>>>
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100311/1bd3fc4f/attachment-0001.html 


More information about the torqueusers mailing list