[torqueusers] my new pbs server is not working

Gus Correa gus at ldeo.columbia.edu
Wed Mar 10 16:12:08 MST 2010


Hi Shibo

Sorry, I forgot this important step.
On your master node do this (you may need to do this as root,
or using "su" or "sudo", unless the user shibo is also a Torque
administrator):

qmgr -c "set server allow_node_submit = True"

to allow jobs to be submitted from all nodes,
not only from the master.

To confirm that the server configuration changed,
do:

qmgr -c "print server"


Also:

1) From what you say, it looks like your qsub is in /usr/local/bin/qsub,
not in /var/spool/torque/bin (my wrong guess).
2) There are no torque.sh and torque.csh files in /etc/profile.d.
You would need to *create* them.
However, this may not be necessary, as your Torque qsub command is
installed on /usr/local/bin, which is likely to be in your PATH already.

I hope this helps.
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------


shibo kuang wrote:
> Dear Gus,
> thanks for your reply.
> I am trying moving from windows to linux to do simulations, and thus 
> not  familar with linux things.
> resubmission is not working both on master and node although 
> the submission works one time for both.
> when I run "which qsub" on master and node, both get  "/usr/local/bin/qsub".
> using export to set the parth ( export PATH=/usr/local/bin:${PATH}) is 
> not  working. I cannot find torque.sh, thus cannot test the second 
> method suggested. there is no the folder "/var/spool/torque/bin". 
> Insteresting, in /var/spool/torque/pbs_environment, it gives 
> "PATH=/bin:/usr/bin"
> thanks again and your further suggestions would be greatly appreciated.
> Cheers,
> Shibo kuang
>  
>  
> 
> On Thu, Mar 11, 2010 at 4:18 AM, Gus Correa <gus at ldeo.columbia.edu 
> <mailto:gus at ldeo.columbia.edu>> wrote:
> 
>     Hi Shibo
> 
>     Glad that your Torque/PBS is now working.
> 
>     I would guess the problem you have now with job resubmission
>     is related to your PATH environment variable.
>     Somehow Linux cannot find qsub, and I suppose this happens in the
>     slave node.
> 
>     Does it happen in the master node also?
>     What do you get if you login to the slave node and do "which qsub",
>     or just "qsub"?
> 
>     Again, this is not a Torque problem, more of a Sys Admin issue.
>     A possible fix may depend a bit on where you installed Torque.
>     Assuming it is installed in /var/spool/torque/,
>     add /var/spool/torque/bin to your path,
>     on your shell initialization script:
> 
>     For csh/tcsh, in your .cshrc/.tcshrc
> 
>     setenv PATH /var/spool/torque/bin:${PATH}
> 
>     For sh/bash in .profile or maybe .bashrc
> 
>     export PATH=/var/spool/torque/bin:${PATH}
> 
>     An alternative is to add a torque.sh and a torque.csh file
>     to the /etc/profile.d directory *on every node* with the
>     contents above.
>     (This may depend a bit on which Linux distribution you use.
>     It works for Fedora, RedHat, and CentOS, may work for others too.)
> 
> 
>     I hope this helps.
> 
>     Gus Correa
>     ---------------------------------------------------------------------
>     Gustavo Correa
>     Lamont-Doherty Earth Observatory - Columbia University
>     Palisades, NY, 10964-8000 - USA
>     ---------------------------------------------------------------------
> 
>     shibo kuang wrote:
> 
>         Hi All,
>         Now my pbs server can work with the help of Gus Correa. My
>         problem is due to the fact that I did mount my master folder to
>         nodes. Here, i got another problem for automatically restarting
>         a job.
>         Below is my script
>          #!/bin/bash
>         #PBS -N inc90
>         #PBS -q short
>         #PBS -l walltime=00:08:00
>         cd $PBS_O_WORKDIR
>         ./nspff >out
>         if [ -f jobfinished ]; then
>            rm -f jobfinished
>            exit 0
>         fi
>         sleep 10
>         qsub case
>          my code stops at 7min, it is supposed to get started
>         automatically after 10s, but failed with the following error:
>          /var/spool/torque/mom_priv/jobs/120.master.SC
>         <http://120.master.sc/> <http://120.master.SC
>         <http://120.master.sc/>>: line 13: qsub: command not found
> 
>          Your help would be greatly appreciated.
>          Regards,
>         Shibo Kuang
>          
>          
> 
>            On Wed, Mar 10, 2010 at 2:57 AM, Gus Correa
>         <gus at ldeo.columbia.edu <mailto:gus at ldeo.columbia.edu>
>            <mailto:gus at ldeo.columbia.edu
>         <mailto:gus at ldeo.columbia.edu>>> wrote:
> 
>                Hi Shibo
> 
>                Somehow your "slave" computer
>                doesn't see /home/kuang/sharpbend/s1/r8,
>                although it can be seen by the "master" computer.
>                It may be one of several things,
>                it is hard to tell exactly with the information you gave,
>                but here are some guesses.
> 
>                Do you really have a separate /home/kuang/sharpbend/s1/r8
>                on your "slave" computer, or is it only present in the
>         "master"?
>                You can login to the "slave" and check this directly
>                ("ls home/kuang/sharpbend/s1/r8").
>                If the directory is not there,
>                this is not really a Torque or MPI problem,
>                but a Sys Admin problem with exporting and mounting
>         directories.
> 
>                If that directory exists only on the master side,
>                you can either create an identical copy on the "slave" side
>                (painful),
>                or use NFS to export it from the "master" computer to the
>                "slave" (easier).
> 
>                For the second approach, you need to export the /home or
>         /home/kuang
>                on the "master" computer, and automount it on the "slave"
>         computer.
>                The files you need to edit are /etc/exports (master side),
>                and /etc/auto.master plus perhaps /etc/auto.home (slave
>         side).
> 
>                A bit different approach (not using the automounter),
>                is just to hard mount /home or /home/kuang
>                on the "slave" side by adding it to the /etc/fstab list.
> 
>                You also need to turn on the NFS daemon on the "master"
>         node with
>                "chkconfig", if it is not yet turned on.
> 
>                Read the man pages!
>                At least read "man exportfs", "man mountd", "man fstab",
>                and "man chkconfig".
> 
>                You may need to reboot the computers for this to take effect.
>                Then login to the "slave" and try again
>                "ls home/kuang/sharpbend/s1/r8".
> 
>                I hope this helps.
>                Gus Correa
>              
>          ---------------------------------------------------------------------
>                Gustavo Correa
>                Lamont-Doherty Earth Observatory - Columbia University
>                Palisades, NY, 10964-8000 - USA
>              
>          ---------------------------------------------------------------------
> 
>                shibo kuang wrote:
> 
>                    "/home/kuang/sharpbend/s1/r8: No such file or directory."
>                    my node does  not have the directory, but my master
>         has it.
>                    
>                    On Sun, Mar 7, 2010 at 1:09 AM, shibo kuang
>                    <s.b.kuang at gmail.com <mailto:s.b.kuang at gmail.com>
>         <mailto:s.b.kuang at gmail.com <mailto:s.b.kuang at gmail.com>>
>                    <mailto:s.b.kuang at gmail.com
>         <mailto:s.b.kuang at gmail.com> <mailto:s.b.kuang at gmail.com
>         <mailto:s.b.kuang at gmail.com>>>>
> 
>                    wrote:
> 
>                       Hi,
>                       I just fix the problem using password  free
>         between the
>                    computing
>                       node and the master.
>                       But now i got another problem:
>                       in r8.e19, it says
>                       /home/kuang/sharpbend/s1/r8: No such file or
>         directory.
>                       if only one computer is used, the sever can work
>         normally.
>                       Where is missed by me when I install the torque?
>                       Your help would be greatly appreciated.
>                       Cheers,
>                       Shibo Kuang
> 
> 
>                            On Sun, Mar 7, 2010 at 12:46 AM, shibo kuang
>                    <s.b.kuang at gmail.com <mailto:s.b.kuang at gmail.com>
>         <mailto:s.b.kuang at gmail.com <mailto:s.b.kuang at gmail.com>>
>                       <mailto:s.b.kuang at gmail.com
>         <mailto:s.b.kuang at gmail.com>
>                    <mailto:s.b.kuang at gmail.com
>         <mailto:s.b.kuang at gmail.com>>>> wrote:
> 
>                           Hi all,
>                           I tried to install a pbs server for my two
>         centos linux
>                           computers (each have 8 cores), but failed..
>                           Here is my problem:
>                           if i treat one computer as master for runnig
>                    pbs_server, as well
>                           as a computing node. I can submit jobs using
>         script
>                    without any
>                           problem. All jobs give the exact results.    
>                        However, when one computer is treated as a
>         master, and
>                           another is a compting node. jobs ara never
>         submitted
>                    sucessfully.
>                           I would appreciate your hints and suggestions
>                    according the
>                           following prompts i got.
>                           Regards,
>                           Shibo Kuang
>                                    Return-Path: <adm at master
>         <mailto:adm at master <mailto:adm at master>
>                    <mailto:adm at master <mailto:adm at master>>>>
> 
>                           Received: from master (localhost [127.0.0.1])
>                                   by master (8.13.1/8.13.1) with ESMTP id
>                    o26DwKF9006310
>                                   for <kuang at master <mailto:kuang at master
>         <mailto:kuang at master>
>                    <mailto:kuang at master <mailto:kuang at master>>>>; Sun, 7 Mar
> 
>                           2010 00:28:20 +1030
>                           Received: (from root at localhost
>         <mailto:root at localhost <mailto:root at localhost>
>                    <mailto:root at localhost <mailto:root at localhost>>>)
> 
>                                   by master (8.13.1/8.13.1/Submit) id
>                    o26DwKpZ006293
>                                   for kuang at master <mailto:kuang at master
>         <mailto:kuang at master>
>                    <mailto:kuang at master <mailto:kuang at master>>>; Sun, 7
>         Mar 2010
> 
>                           00:28:20 +1030
>                           Date: Sun, 7 Mar 2010 00:28:20 +1030
>                           From: adm <adm at master <mailto:adm at master
>         <mailto:adm at master>
>                    <mailto:adm at master <mailto:adm at master>>>>
> 
>                           Message-Id: <201003061358.o26DwKpZ006293 at master
>                           <mailto:201003061358.o26DwKpZ006293 at master
>         <mailto:201003061358.o26DwKpZ006293 at master>
>                    <mailto:201003061358.o26DwKpZ006293 at master
>         <mailto:201003061358.o26DwKpZ006293 at master>>>>
>                           To: kuang at master <mailto:kuang at master
>         <mailto:kuang at master>
>                    <mailto:kuang at master <mailto:kuang at master>>>
> 
>                           Subject: PBS JOB 18.master
>                           Precedence: bulk
>                           PBS Job Id: 18.master
>                           Job Name:   r8
>                           Exec host:  par1/0
>                           An error has occurred processing your job, see
>         below.
>                           Post job file processing error; job 18.master
>         on host
>                    par1/0
>                           Unable to copy file
>                    /var/spool/torque/spool/18.master.OU to
>                           kuang at master:/home/kuang/sharpbend/s1/r8/r8.o18
>                           <mailto:kuang at master <mailto:kuang at master>
>                    <mailto:kuang at master
>         <mailto:kuang at master>>:/home/kuang/sharpbend/s1/r8/r8.o18>
> 
>                           *** error from copy
>                           Permission denied
>         (publickey,gssapi-with-mic,password).
>                           lost connection
>                           *** end error output
>                           Output retained on that host in:
>                           /var/spool/torque/undelivered/18.master.OU
>                           Unable to copy file
>                    /var/spool/torque/spool/18.master.ER
>         <http://18.master.er/> <http://18.master.er/>
>                           <http://18.master.er/> to
> 
>                           kuang at master:/home/kuang/sharpbend/s1/r8/r8.e18
>                           <mailto:kuang at master <mailto:kuang at master>
>                    <mailto:kuang at master
>         <mailto:kuang at master>>:/home/kuang/sharpbend/s1/r8/r8.e18>
> 
>                           *** error from copy
>                           Permission denied
>         (publickey,gssapi-with-mic,password).
>                           lost connection
>                           *** end error output
>                           Output retained on that host in:
>                           /var/spool/torque/undelivered/18.master.ER
>         <http://18.master.er/>
>                    <http://18.master.er/> <http://18.master.er/>
> 
> 
> 
> 
>                  
>          ------------------------------------------------------------------------
> 
> 
> 
>                    _______________________________________________
>                    torqueusers mailing list
>                    torqueusers at supercluster.org
>         <mailto:torqueusers at supercluster.org>
>                    <mailto:torqueusers at supercluster.org
>         <mailto:torqueusers at supercluster.org>>
> 
>                    http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> 
> 
> 
> 
> 



More information about the torqueusers mailing list