[torqueusers] Re: Re: Trouble running jobs with TORQUE

Donald Tripp dtripp at hawaii.edu
Tue Mar 27 20:15:42 MDT 2007

All users who run jobs need to have password-less logins. The easiest  
way to do this is setup ssh keypairs across a shared home directory.  
They don't need to be able to login all the time; pbs_server can  
modify the access.conf file if necessary. So, here's a little bit of  
a hypothetical situation:

exportfs /home --> compute_nodes/home

cd /home/user
ssh-keygen -t rsa ...
cat .ssh/authorized_keys
   xxxxxxx user at admin

cd /home/user/job
qsub job_file

when the job gets queued, the scheduler will ssh to the compute nodes  
from the users account:

ssh compute01

is what it should get. Remember, its in batch mode , so it can't  
enter a password.

- Donald Tripp
  dtripp at hawaii.edu
HPC Systems Administrator
High Performance Computing Center
University of Hawai'i at Hilo
200 W. Kawili Street
Hilo,   Hawaii   96720

On Mar 27, 2007, at 3:57 PM, aohara at haverford.edu wrote:

> Thanks for responding.  I also got a private mail response asking a  
> few
> questions about the set up.  So I'll answer that here too.
> In and e-mail from Thomas Pierce, he suggested not using pbs_sched  
> at all
> and solely using Maui for the scheduler.  I just wanted to clarify  
> that is
> what we are/plain to do.  I just tried it out with both to isolate the
> problem as torque or maui specific.  Also the version of TORQUE is  
> 2.1.8.
> I did do a configuration with the '--with-scp' command, however I was
> under the impression that only the root account, which is
> running/operating TORQUE and Maui, was the only one that needed to  
> have
> passwordless ssh.  Under the current compiled configuration of TORQUE
> would I then need to add all users to each compute node and then  
> set up
> ssh for them?  Is there a configuration option in the compile that  
> I can
> keep it so only the root user has passwordless-ssh set up to the  
> compute
> nodes?
> After running echo 'hostname' | qsub, no files were outputted either.
> Thanks again,
> Andy
> P.S. Is there an easier way to reply to a message since I'm getting  
> the
> digests. Thanks.
> Date: Tue, 27 Mar 2007 11:26:49 -0400
> From: nathaniel.x.woody at gsk.com
> Subject: Re: [torqueusers] Trouble running jobs with TORQUE
> To: torqueusers at supercluster.org
> Message-ID:
>         <OFD4E27091.803C924B-ON852572AB.00542660-852572AB. 
> 0054DF22 at gsk.com>
> Content-Type: text/plain; charset="us-ascii"
> The off the cuff answer is that there might be a problem with the  
> rsh/ssh
> permissions on the system.  Have you verified that the user  
> submitting the
> job (administrator at babbage) can do a passwordless ssh (assuming you
> configured with --with-scp) to the compute nodes and back to the  
> headnode.
> For the test echo 'hostname' | qsub are you getting stdout and  
> stderror
> files back? (STDIN.e123456 looking things)?  If you are, is there  
> anything
> in them?  Is the administrator getting an email about these jobs  
> with any
> information in them?
> A seperate issue with python that I have run into is ensuring that the
> 'all set' python setup includes PYTHONPATH being set appropriately  
> in the
> shell that torque opens, if you have installed extra packages.  But  
> any
> problems here should show up in a stack trace in the stderror file  
> and can
> be diagnosed that way.
> Hope that gives you a start,
> Nate
> aohara at haverford.edu
> Sent by: torqueusers-bounces at supercluster.org
> 26-Mar-2007 17:48
> To
> torqueusers at supercluster.org
> cc
> Subject
> [torqueusers] Trouble running jobs with TORQUE
> Hi,
> We just recently began setting up a linux cluster here at Haverford
> College using TORQUE and Maui.  The general specs are 6 blades with  
> two
> dual core AMD opterons, 16 gb ram, and a head node with a similar
> processor setup.
> Over the past week, we installed TORQUE (and Maui), however TORQUE  
> seems
> to be having trouble running jobs.
> Running 'pbsnodes -a' reports correctly on the state of all nodes  
> and if
> neither pbs_sched or Maui are running then qstat shows jobs labeled  
> Q, as
> expected.  However, when either pbs_sched or Maui are running, the  
> jobs
> don't seem to be running properly.  I tried submitting both the test
> phrase `echo "sleep 30" | qsub' and a script `qsub testjob' where  
> testjob
> is a script containing `python myprogram.py'.  All necessary python
> packages are installed too, so I know this isn't the problem (I've
> manually ran the python code on all nodes).  The reason I suspect some
> form of TORQUE error is that this job also completes immediately,  
> even tho
> it should take roughly 20 minutes to run.  The tracejob output for  
> one is
> here (both are basically the same though):
> 03/26/2007 17:25:17  S    enqueuing into batch, state 1 hop 1
> 03/26/2007 17:25:17  S    Job Queued at request of  
> administrator at babbage,
> owner
>                           = administrator at babbage, job name =  
> testjob.sh,
> queue
>                           = batch
> 03/26/2007 17:25:18  S    Job Modified at request of root at babbage
> 03/26/2007 17:25:18  S    Job Run at request of root at babbage
> 03/26/2007 17:25:18  S    Job Modified at request of root at babbage
> 03/26/2007 17:25:18  S    Exit_status=-1
> 03/26/2007 17:25:18  S    Post job file processing error
> 03/26/2007 17:25:18  S    dequeuing from batch, state COMPLETE
> Any help would be greatly appreciated, thanks.  If you need any more
> information about our cluster hardward/software setup just ask.
> Thanks,
> Andy O'Hara
> Haverford College Physics '09
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20070327/5a8d479e/attachment-0001.html

More information about the torqueusers mailing list