[torqueusers] my new pbs server is not working
Gus Correa
gus at ldeo.columbia.edu
Wed Mar 10 16:12:08 MST 2010
Hi Shibo
Sorry, I forgot this important step.
On your master node do this (you may need to do this as root,
or using "su" or "sudo", unless the user shibo is also a Torque
administrator):
qmgr -c "set server allow_node_submit = True"
to allow jobs to be submitted from all nodes,
not only from the master.
To confirm that the server configuration changed,
do:
qmgr -c "print server"
Also:
1) From what you say, it looks like your qsub is in /usr/local/bin/qsub,
not in /var/spool/torque/bin (my wrong guess).
2) There are no torque.sh and torque.csh files in /etc/profile.d.
You would need to *create* them.
However, this may not be necessary, as your Torque qsub command is
installed on /usr/local/bin, which is likely to be in your PATH already.
I hope this helps.
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------
shibo kuang wrote:
> Dear Gus,
> thanks for your reply.
> I am trying moving from windows to linux to do simulations, and thus
> not familar with linux things.
> resubmission is not working both on master and node although
> the submission works one time for both.
> when I run "which qsub" on master and node, both get "/usr/local/bin/qsub".
> using export to set the parth ( export PATH=/usr/local/bin:${PATH}) is
> not working. I cannot find torque.sh, thus cannot test the second
> method suggested. there is no the folder "/var/spool/torque/bin".
> Insteresting, in /var/spool/torque/pbs_environment, it gives
> "PATH=/bin:/usr/bin"
> thanks again and your further suggestions would be greatly appreciated.
> Cheers,
> Shibo kuang
>
>
>
> On Thu, Mar 11, 2010 at 4:18 AM, Gus Correa <gus at ldeo.columbia.edu
> <mailto:gus at ldeo.columbia.edu>> wrote:
>
> Hi Shibo
>
> Glad that your Torque/PBS is now working.
>
> I would guess the problem you have now with job resubmission
> is related to your PATH environment variable.
> Somehow Linux cannot find qsub, and I suppose this happens in the
> slave node.
>
> Does it happen in the master node also?
> What do you get if you login to the slave node and do "which qsub",
> or just "qsub"?
>
> Again, this is not a Torque problem, more of a Sys Admin issue.
> A possible fix may depend a bit on where you installed Torque.
> Assuming it is installed in /var/spool/torque/,
> add /var/spool/torque/bin to your path,
> on your shell initialization script:
>
> For csh/tcsh, in your .cshrc/.tcshrc
>
> setenv PATH /var/spool/torque/bin:${PATH}
>
> For sh/bash in .profile or maybe .bashrc
>
> export PATH=/var/spool/torque/bin:${PATH}
>
> An alternative is to add a torque.sh and a torque.csh file
> to the /etc/profile.d directory *on every node* with the
> contents above.
> (This may depend a bit on which Linux distribution you use.
> It works for Fedora, RedHat, and CentOS, may work for others too.)
>
>
> I hope this helps.
>
> Gus Correa
> ---------------------------------------------------------------------
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> ---------------------------------------------------------------------
>
> shibo kuang wrote:
>
> Hi All,
> Now my pbs server can work with the help of Gus Correa. My
> problem is due to the fact that I did mount my master folder to
> nodes. Here, i got another problem for automatically restarting
> a job.
> Below is my script
> #!/bin/bash
> #PBS -N inc90
> #PBS -q short
> #PBS -l walltime=00:08:00
> cd $PBS_O_WORKDIR
> ./nspff >out
> if [ -f jobfinished ]; then
> rm -f jobfinished
> exit 0
> fi
> sleep 10
> qsub case
> my code stops at 7min, it is supposed to get started
> automatically after 10s, but failed with the following error:
> /var/spool/torque/mom_priv/jobs/120.master.SC
> <http://120.master.sc/> <http://120.master.SC
> <http://120.master.sc/>>: line 13: qsub: command not found
>
> Your help would be greatly appreciated.
> Regards,
> Shibo Kuang
>
>
>
> On Wed, Mar 10, 2010 at 2:57 AM, Gus Correa
> <gus at ldeo.columbia.edu <mailto:gus at ldeo.columbia.edu>
> <mailto:gus at ldeo.columbia.edu
> <mailto:gus at ldeo.columbia.edu>>> wrote:
>
> Hi Shibo
>
> Somehow your "slave" computer
> doesn't see /home/kuang/sharpbend/s1/r8,
> although it can be seen by the "master" computer.
> It may be one of several things,
> it is hard to tell exactly with the information you gave,
> but here are some guesses.
>
> Do you really have a separate /home/kuang/sharpbend/s1/r8
> on your "slave" computer, or is it only present in the
> "master"?
> You can login to the "slave" and check this directly
> ("ls home/kuang/sharpbend/s1/r8").
> If the directory is not there,
> this is not really a Torque or MPI problem,
> but a Sys Admin problem with exporting and mounting
> directories.
>
> If that directory exists only on the master side,
> you can either create an identical copy on the "slave" side
> (painful),
> or use NFS to export it from the "master" computer to the
> "slave" (easier).
>
> For the second approach, you need to export the /home or
> /home/kuang
> on the "master" computer, and automount it on the "slave"
> computer.
> The files you need to edit are /etc/exports (master side),
> and /etc/auto.master plus perhaps /etc/auto.home (slave
> side).
>
> A bit different approach (not using the automounter),
> is just to hard mount /home or /home/kuang
> on the "slave" side by adding it to the /etc/fstab list.
>
> You also need to turn on the NFS daemon on the "master"
> node with
> "chkconfig", if it is not yet turned on.
>
> Read the man pages!
> At least read "man exportfs", "man mountd", "man fstab",
> and "man chkconfig".
>
> You may need to reboot the computers for this to take effect.
> Then login to the "slave" and try again
> "ls home/kuang/sharpbend/s1/r8".
>
> I hope this helps.
> Gus Correa
>
> ---------------------------------------------------------------------
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
>
> ---------------------------------------------------------------------
>
> shibo kuang wrote:
>
> "/home/kuang/sharpbend/s1/r8: No such file or directory."
> my node does not have the directory, but my master
> has it.
>
> On Sun, Mar 7, 2010 at 1:09 AM, shibo kuang
> <s.b.kuang at gmail.com <mailto:s.b.kuang at gmail.com>
> <mailto:s.b.kuang at gmail.com <mailto:s.b.kuang at gmail.com>>
> <mailto:s.b.kuang at gmail.com
> <mailto:s.b.kuang at gmail.com> <mailto:s.b.kuang at gmail.com
> <mailto:s.b.kuang at gmail.com>>>>
>
> wrote:
>
> Hi,
> I just fix the problem using password free
> between the
> computing
> node and the master.
> But now i got another problem:
> in r8.e19, it says
> /home/kuang/sharpbend/s1/r8: No such file or
> directory.
> if only one computer is used, the sever can work
> normally.
> Where is missed by me when I install the torque?
> Your help would be greatly appreciated.
> Cheers,
> Shibo Kuang
>
>
> On Sun, Mar 7, 2010 at 12:46 AM, shibo kuang
> <s.b.kuang at gmail.com <mailto:s.b.kuang at gmail.com>
> <mailto:s.b.kuang at gmail.com <mailto:s.b.kuang at gmail.com>>
> <mailto:s.b.kuang at gmail.com
> <mailto:s.b.kuang at gmail.com>
> <mailto:s.b.kuang at gmail.com
> <mailto:s.b.kuang at gmail.com>>>> wrote:
>
> Hi all,
> I tried to install a pbs server for my two
> centos linux
> computers (each have 8 cores), but failed..
> Here is my problem:
> if i treat one computer as master for runnig
> pbs_server, as well
> as a computing node. I can submit jobs using
> script
> without any
> problem. All jobs give the exact results.
> However, when one computer is treated as a
> master, and
> another is a compting node. jobs ara never
> submitted
> sucessfully.
> I would appreciate your hints and suggestions
> according the
> following prompts i got.
> Regards,
> Shibo Kuang
> Return-Path: <adm at master
> <mailto:adm at master <mailto:adm at master>
> <mailto:adm at master <mailto:adm at master>>>>
>
> Received: from master (localhost [127.0.0.1])
> by master (8.13.1/8.13.1) with ESMTP id
> o26DwKF9006310
> for <kuang at master <mailto:kuang at master
> <mailto:kuang at master>
> <mailto:kuang at master <mailto:kuang at master>>>>; Sun, 7 Mar
>
> 2010 00:28:20 +1030
> Received: (from root at localhost
> <mailto:root at localhost <mailto:root at localhost>
> <mailto:root at localhost <mailto:root at localhost>>>)
>
> by master (8.13.1/8.13.1/Submit) id
> o26DwKpZ006293
> for kuang at master <mailto:kuang at master
> <mailto:kuang at master>
> <mailto:kuang at master <mailto:kuang at master>>>; Sun, 7
> Mar 2010
>
> 00:28:20 +1030
> Date: Sun, 7 Mar 2010 00:28:20 +1030
> From: adm <adm at master <mailto:adm at master
> <mailto:adm at master>
> <mailto:adm at master <mailto:adm at master>>>>
>
> Message-Id: <201003061358.o26DwKpZ006293 at master
> <mailto:201003061358.o26DwKpZ006293 at master
> <mailto:201003061358.o26DwKpZ006293 at master>
> <mailto:201003061358.o26DwKpZ006293 at master
> <mailto:201003061358.o26DwKpZ006293 at master>>>>
> To: kuang at master <mailto:kuang at master
> <mailto:kuang at master>
> <mailto:kuang at master <mailto:kuang at master>>>
>
> Subject: PBS JOB 18.master
> Precedence: bulk
> PBS Job Id: 18.master
> Job Name: r8
> Exec host: par1/0
> An error has occurred processing your job, see
> below.
> Post job file processing error; job 18.master
> on host
> par1/0
> Unable to copy file
> /var/spool/torque/spool/18.master.OU to
> kuang at master:/home/kuang/sharpbend/s1/r8/r8.o18
> <mailto:kuang at master <mailto:kuang at master>
> <mailto:kuang at master
> <mailto:kuang at master>>:/home/kuang/sharpbend/s1/r8/r8.o18>
>
> *** error from copy
> Permission denied
> (publickey,gssapi-with-mic,password).
> lost connection
> *** end error output
> Output retained on that host in:
> /var/spool/torque/undelivered/18.master.OU
> Unable to copy file
> /var/spool/torque/spool/18.master.ER
> <http://18.master.er/> <http://18.master.er/>
> <http://18.master.er/> to
>
> kuang at master:/home/kuang/sharpbend/s1/r8/r8.e18
> <mailto:kuang at master <mailto:kuang at master>
> <mailto:kuang at master
> <mailto:kuang at master>>:/home/kuang/sharpbend/s1/r8/r8.e18>
>
> *** error from copy
> Permission denied
> (publickey,gssapi-with-mic,password).
> lost connection
> *** end error output
> Output retained on that host in:
> /var/spool/torque/undelivered/18.master.ER
> <http://18.master.er/>
> <http://18.master.er/> <http://18.master.er/>
>
>
>
>
>
> ------------------------------------------------------------------------
>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> <mailto:torqueusers at supercluster.org>
> <mailto:torqueusers at supercluster.org
> <mailto:torqueusers at supercluster.org>>
>
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>
>
>
More information about the torqueusers
mailing list