[torqueusers] Re: Re: Trouble running jobs with TORQUE
Donald Tripp
dtripp at hawaii.edu
Tue Mar 27 20:15:42 MDT 2007
All users who run jobs need to have password-less logins. The easiest
way to do this is setup ssh keypairs across a shared home directory.
They don't need to be able to login all the time; pbs_server can
modify the access.conf file if necessary. So, here's a little bit of
a hypothetical situation:
exportfs /home --> compute_nodes/home
cd /home/user
ssh-keygen -t rsa ...
cat .ssh/authorized_keys
xxxxxxx user at admin
cd /home/user/job
qsub job_file
when the job gets queued, the scheduler will ssh to the compute nodes
from the users account:
ssh compute01
compute01$
is what it should get. Remember, its in batch mode , so it can't
enter a password.
- Donald Tripp
dtripp at hawaii.edu
----------------------------------------------
HPC Systems Administrator
High Performance Computing Center
University of Hawai'i at Hilo
200 W. Kawili Street
Hilo, Hawaii 96720
http://www.hpc.uhh.hawaii.edu
On Mar 27, 2007, at 3:57 PM, aohara at haverford.edu wrote:
> Thanks for responding. I also got a private mail response asking a
> few
> questions about the set up. So I'll answer that here too.
> In and e-mail from Thomas Pierce, he suggested not using pbs_sched
> at all
> and solely using Maui for the scheduler. I just wanted to clarify
> that is
> what we are/plain to do. I just tried it out with both to isolate the
> problem as torque or maui specific. Also the version of TORQUE is
> 2.1.8.
>
> I did do a configuration with the '--with-scp' command, however I was
> under the impression that only the root account, which is
> running/operating TORQUE and Maui, was the only one that needed to
> have
> passwordless ssh. Under the current compiled configuration of TORQUE
> would I then need to add all users to each compute node and then
> set up
> ssh for them? Is there a configuration option in the compile that
> I can
> keep it so only the root user has passwordless-ssh set up to the
> compute
> nodes?
>
> After running echo 'hostname' | qsub, no files were outputted either.
>
> Thanks again,
> Andy
>
> P.S. Is there an easier way to reply to a message since I'm getting
> the
> digests. Thanks.
>
> Date: Tue, 27 Mar 2007 11:26:49 -0400
> From: nathaniel.x.woody at gsk.com
> Subject: Re: [torqueusers] Trouble running jobs with TORQUE
> To: torqueusers at supercluster.org
> Message-ID:
> <OFD4E27091.803C924B-ON852572AB.00542660-852572AB.
> 0054DF22 at gsk.com>
> Content-Type: text/plain; charset="us-ascii"
>
> The off the cuff answer is that there might be a problem with the
> rsh/ssh
> permissions on the system. Have you verified that the user
> submitting the
> job (administrator at babbage) can do a passwordless ssh (assuming you
> configured with --with-scp) to the compute nodes and back to the
> headnode.
>
> For the test echo 'hostname' | qsub are you getting stdout and
> stderror
> files back? (STDIN.e123456 looking things)? If you are, is there
> anything
> in them? Is the administrator getting an email about these jobs
> with any
> information in them?
>
> A seperate issue with python that I have run into is ensuring that the
> 'all set' python setup includes PYTHONPATH being set appropriately
> in the
> shell that torque opens, if you have installed extra packages. But
> any
> problems here should show up in a stack trace in the stderror file
> and can
> be diagnosed that way.
>
> Hope that gives you a start,
> Nate
>
>
>
>
>
> aohara at haverford.edu
> Sent by: torqueusers-bounces at supercluster.org
> 26-Mar-2007 17:48
>
> To
> torqueusers at supercluster.org
> cc
>
> Subject
> [torqueusers] Trouble running jobs with TORQUE
>
>
>
>
>
>
> Hi,
> We just recently began setting up a linux cluster here at Haverford
> College using TORQUE and Maui. The general specs are 6 blades with
> two
> dual core AMD opterons, 16 gb ram, and a head node with a similar
> processor setup.
> Over the past week, we installed TORQUE (and Maui), however TORQUE
> seems
> to be having trouble running jobs.
> Running 'pbsnodes -a' reports correctly on the state of all nodes
> and if
> neither pbs_sched or Maui are running then qstat shows jobs labeled
> Q, as
> expected. However, when either pbs_sched or Maui are running, the
> jobs
> don't seem to be running properly. I tried submitting both the test
> phrase `echo "sleep 30" | qsub' and a script `qsub testjob' where
> testjob
> is a script containing `python myprogram.py'. All necessary python
> packages are installed too, so I know this isn't the problem (I've
> manually ran the python code on all nodes). The reason I suspect some
> form of TORQUE error is that this job also completes immediately,
> even tho
> it should take roughly 20 minutes to run. The tracejob output for
> one is
> here (both are basically the same though):
>
> 03/26/2007 17:25:17 S enqueuing into batch, state 1 hop 1
> 03/26/2007 17:25:17 S Job Queued at request of
> administrator at babbage,
> owner
> = administrator at babbage, job name =
> testjob.sh,
> queue
> = batch
> 03/26/2007 17:25:18 S Job Modified at request of root at babbage
> 03/26/2007 17:25:18 S Job Run at request of root at babbage
> 03/26/2007 17:25:18 S Job Modified at request of root at babbage
> 03/26/2007 17:25:18 S Exit_status=-1
> 03/26/2007 17:25:18 S Post job file processing error
> 03/26/2007 17:25:18 S dequeuing from batch, state COMPLETE
>
> Any help would be greatly appreciated, thanks. If you need any more
> information about our cluster hardward/software setup just ask.
>
> Thanks,
> Andy O'Hara
> Haverford College Physics '09
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20070327/5a8d479e/attachment-0001.html
More information about the torqueusers
mailing list