[torqueusers] my new pbs server is not working

Gus Correa gus at ldeo.columbia.edu
Wed Mar 10 10:18:32 MST 2010


Hi Shibo

Glad that your Torque/PBS is now working.

I would guess the problem you have now with job resubmission
is related to your PATH environment variable.
Somehow Linux cannot find qsub, and I suppose this happens in the
slave node.

Does it happen in the master node also?
What do you get if you login to the slave node and do "which qsub",
or just "qsub"?

Again, this is not a Torque problem, more of a Sys Admin issue.
A possible fix may depend a bit on where you installed Torque.
Assuming it is installed in /var/spool/torque/,
add /var/spool/torque/bin to your path,
on your shell initialization script:

For csh/tcsh, in your .cshrc/.tcshrc

setenv PATH /var/spool/torque/bin:${PATH}

For sh/bash in .profile or maybe .bashrc

export PATH=/var/spool/torque/bin:${PATH}

An alternative is to add a torque.sh and a torque.csh file
to the /etc/profile.d directory *on every node* with the
contents above.
(This may depend a bit on which Linux distribution you use.
It works for Fedora, RedHat, and CentOS, may work for others too.)

I hope this helps.

Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

shibo kuang wrote:
> Hi All,
> Now my pbs server can work with the help of Gus Correa. My problem is 
> due to the fact that I did mount my master folder to nodes. Here, i got 
> another problem for automatically restarting a job.
> Below is my script
>  
> #!/bin/bash
> #PBS -N inc90
> #PBS -q short
> #PBS -l walltime=00:08:00
> cd $PBS_O_WORKDIR
> ./nspff >out
> if [ -f jobfinished ]; then
>     rm -f jobfinished
>     exit 0
> fi
> sleep 10
> qsub case
>  
> my code stops at 7min, it is supposed to get started automatically after 
> 10s, but failed with the following error:
>  
> /var/spool/torque/mom_priv/jobs/120.master.SC <http://120.master.SC>: 
> line 13: qsub: command not found
>  
> Your help would be greatly appreciated.
>  
> Regards,
> Shibo Kuang
>  
>  
> 
>  
> 
> 
>     On Wed, Mar 10, 2010 at 2:57 AM, Gus Correa <gus at ldeo.columbia.edu
>     <mailto:gus at ldeo.columbia.edu>> wrote:
> 
>         Hi Shibo
> 
>         Somehow your "slave" computer
>         doesn't see /home/kuang/sharpbend/s1/r8,
>         although it can be seen by the "master" computer.
>         It may be one of several things,
>         it is hard to tell exactly with the information you gave,
>         but here are some guesses.
> 
>         Do you really have a separate /home/kuang/sharpbend/s1/r8
>         on your "slave" computer, or is it only present in the "master"?
>         You can login to the "slave" and check this directly
>         ("ls home/kuang/sharpbend/s1/r8").
>         If the directory is not there,
>         this is not really a Torque or MPI problem,
>         but a Sys Admin problem with exporting and mounting directories.
> 
>         If that directory exists only on the master side,
>         you can either create an identical copy on the "slave" side
>         (painful),
>         or use NFS to export it from the "master" computer to the
>         "slave" (easier).
> 
>         For the second approach, you need to export the /home or /home/kuang
>         on the "master" computer, and automount it on the "slave" computer.
>         The files you need to edit are /etc/exports (master side),
>         and /etc/auto.master plus perhaps /etc/auto.home (slave side).
> 
>         A bit different approach (not using the automounter),
>         is just to hard mount /home or /home/kuang
>         on the "slave" side by adding it to the /etc/fstab list.
> 
>         You also need to turn on the NFS daemon on the "master" node with
>         "chkconfig", if it is not yet turned on.
> 
>         Read the man pages!
>         At least read "man exportfs", "man mountd", "man fstab",
>         and "man chkconfig".
> 
>         You may need to reboot the computers for this to take effect.
>         Then login to the "slave" and try again
>         "ls home/kuang/sharpbend/s1/r8".
> 
>         I hope this helps.
>         Gus Correa
>         ---------------------------------------------------------------------
>         Gustavo Correa
>         Lamont-Doherty Earth Observatory - Columbia University
>         Palisades, NY, 10964-8000 - USA
>         ---------------------------------------------------------------------
> 
>         shibo kuang wrote:
> 
>             "/home/kuang/sharpbend/s1/r8: No such file or directory."
>             my node does  not have the directory, but my master has it.
>              
> 
>             On Sun, Mar 7, 2010 at 1:09 AM, shibo kuang
>             <s.b.kuang at gmail.com <mailto:s.b.kuang at gmail.com>
>             <mailto:s.b.kuang at gmail.com <mailto:s.b.kuang at gmail.com>>>
>             wrote:
> 
>                Hi,
>                I just fix the problem using password  free between the
>             computing
>                node and the master.
>                But now i got another problem:
>                in r8.e19, it says
>                /home/kuang/sharpbend/s1/r8: No such file or directory.
>                if only one computer is used, the sever can work normally.
>                Where is missed by me when I install the torque?
>                Your help would be greatly appreciated.
>                Cheers,
>                Shibo Kuang
> 
> 
>                     On Sun, Mar 7, 2010 at 12:46 AM, shibo kuang
>             <s.b.kuang at gmail.com <mailto:s.b.kuang at gmail.com>
>                <mailto:s.b.kuang at gmail.com
>             <mailto:s.b.kuang at gmail.com>>> wrote:
> 
>                    Hi all,
>                    I tried to install a pbs server for my two centos linux
>                    computers (each have 8 cores), but failed..
>                    Here is my problem:
>                    if i treat one computer as master for runnig
>             pbs_server, as well
>                    as a computing node. I can submit jobs using script
>             without any
>                    problem. All jobs give the exact results.        
>             However, when one computer is treated as a master, and
>                    another is a compting node. jobs ara never submitted
>             sucessfully.
>                    I would appreciate your hints and suggestions
>             according the
>                    following prompts i got.
>                    Regards,
>                    Shibo Kuang
>                             Return-Path: <adm at master <mailto:adm at master
>             <mailto:adm at master>>>
> 
>                    Received: from master (localhost [127.0.0.1])
>                            by master (8.13.1/8.13.1) with ESMTP id
>             o26DwKF9006310
>                            for <kuang at master <mailto:kuang at master
>             <mailto:kuang at master>>>; Sun, 7 Mar
> 
>                    2010 00:28:20 +1030
>                    Received: (from root at localhost <mailto:root at localhost
>             <mailto:root at localhost>>)
> 
>                            by master (8.13.1/8.13.1/Submit) id
>             o26DwKpZ006293
>                            for kuang at master <mailto:kuang at master
>             <mailto:kuang at master>>; Sun, 7 Mar 2010
> 
>                    00:28:20 +1030
>                    Date: Sun, 7 Mar 2010 00:28:20 +1030
>                    From: adm <adm at master <mailto:adm at master
>             <mailto:adm at master>>>
> 
>                    Message-Id: <201003061358.o26DwKpZ006293 at master
>                    <mailto:201003061358.o26DwKpZ006293 at master
>             <mailto:201003061358.o26DwKpZ006293 at master>>>
>                    To: kuang at master <mailto:kuang at master
>             <mailto:kuang at master>>
> 
>                    Subject: PBS JOB 18.master
>                    Precedence: bulk
>                    PBS Job Id: 18.master
>                    Job Name:   r8
>                    Exec host:  par1/0
>                    An error has occurred processing your job, see below.
>                    Post job file processing error; job 18.master on host
>             par1/0
>                    Unable to copy file
>             /var/spool/torque/spool/18.master.OU to
>                    kuang at master:/home/kuang/sharpbend/s1/r8/r8.o18
>                    <mailto:kuang at master
>             <mailto:kuang at master>:/home/kuang/sharpbend/s1/r8/r8.o18>
> 
>                    *** error from copy
>                    Permission denied (publickey,gssapi-with-mic,password).
>                    lost connection
>                    *** end error output
>                    Output retained on that host in:
>                    /var/spool/torque/undelivered/18.master.OU
>                    Unable to copy file
>             /var/spool/torque/spool/18.master.ER <http://18.master.er/>
>                    <http://18.master.er/> to
> 
>                    kuang at master:/home/kuang/sharpbend/s1/r8/r8.e18
>                    <mailto:kuang at master
>             <mailto:kuang at master>:/home/kuang/sharpbend/s1/r8/r8.e18>
> 
>                    *** error from copy
>                    Permission denied (publickey,gssapi-with-mic,password).
>                    lost connection
>                    *** end error output
>                    Output retained on that host in:
>                    /var/spool/torque/undelivered/18.master.ER
>             <http://18.master.er/> <http://18.master.er/>
> 
> 
> 
>             ------------------------------------------------------------------------
> 
> 
> 
>             _______________________________________________
>             torqueusers mailing list
>             torqueusers at supercluster.org
>             <mailto:torqueusers at supercluster.org>
>             http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> 
> 
> 



More information about the torqueusers mailing list