[torqueusers] Re: Re: Trouble running jobs with TORQUE

aohara at haverford.edu aohara at haverford.edu
Tue Mar 27 19:57:28 MDT 2007


Thanks for responding.  I also got a private mail response asking a few
questions about the set up.  So I'll answer that here too.
In and e-mail from Thomas Pierce, he suggested not using pbs_sched at all
and solely using Maui for the scheduler.  I just wanted to clarify that is
what we are/plain to do.  I just tried it out with both to isolate the
problem as torque or maui specific.  Also the version of TORQUE is 2.1.8.

I did do a configuration with the '--with-scp' command, however I was
under the impression that only the root account, which is
running/operating TORQUE and Maui, was the only one that needed to have
passwordless ssh.  Under the current compiled configuration of TORQUE
would I then need to add all users to each compute node and then set up
ssh for them?  Is there a configuration option in the compile that I can
keep it so only the root user has passwordless-ssh set up to the compute
nodes?

After running echo 'hostname' | qsub, no files were outputted either.

Thanks again,
Andy

P.S. Is there an easier way to reply to a message since I'm getting the
digests. Thanks.

Date: Tue, 27 Mar 2007 11:26:49 -0400
From: nathaniel.x.woody at gsk.com
Subject: Re: [torqueusers] Trouble running jobs with TORQUE
To: torqueusers at supercluster.org
Message-ID:
        <OFD4E27091.803C924B-ON852572AB.00542660-852572AB.0054DF22 at gsk.com>
Content-Type: text/plain; charset="us-ascii"

The off the cuff answer is that there might be a problem with the rsh/ssh
permissions on the system.  Have you verified that the user submitting the
job (administrator at babbage) can do a passwordless ssh (assuming you
configured with --with-scp) to the compute nodes and back to the headnode.

For the test echo 'hostname' | qsub are you getting stdout and stderror
files back? (STDIN.e123456 looking things)?  If you are, is there anything
in them?  Is the administrator getting an email about these jobs with any
information in them?

A seperate issue with python that I have run into is ensuring that the
'all set' python setup includes PYTHONPATH being set appropriately in the
shell that torque opens, if you have installed extra packages.  But any
problems here should show up in a stack trace in the stderror file and can
be diagnosed that way.

Hope that gives you a start,
Nate





aohara at haverford.edu
Sent by: torqueusers-bounces at supercluster.org
26-Mar-2007 17:48

To
torqueusers at supercluster.org
cc

Subject
[torqueusers] Trouble running jobs with TORQUE






Hi,
We just recently began setting up a linux cluster here at Haverford
College using TORQUE and Maui.  The general specs are 6 blades with two
dual core AMD opterons, 16 gb ram, and a head node with a similar
processor setup.
Over the past week, we installed TORQUE (and Maui), however TORQUE seems
to be having trouble running jobs.
Running 'pbsnodes -a' reports correctly on the state of all nodes and if
neither pbs_sched or Maui are running then qstat shows jobs labeled Q, as
expected.  However, when either pbs_sched or Maui are running, the jobs
don't seem to be running properly.  I tried submitting both the test
phrase `echo "sleep 30" | qsub' and a script `qsub testjob' where testjob
is a script containing `python myprogram.py'.  All necessary python
packages are installed too, so I know this isn't the problem (I've
manually ran the python code on all nodes).  The reason I suspect some
form of TORQUE error is that this job also completes immediately, even tho
it should take roughly 20 minutes to run.  The tracejob output for one is
here (both are basically the same though):

03/26/2007 17:25:17  S    enqueuing into batch, state 1 hop 1
03/26/2007 17:25:17  S    Job Queued at request of administrator at babbage,
owner
                          = administrator at babbage, job name = testjob.sh,
queue
                          = batch
03/26/2007 17:25:18  S    Job Modified at request of root at babbage
03/26/2007 17:25:18  S    Job Run at request of root at babbage
03/26/2007 17:25:18  S    Job Modified at request of root at babbage
03/26/2007 17:25:18  S    Exit_status=-1
03/26/2007 17:25:18  S    Post job file processing error
03/26/2007 17:25:18  S    dequeuing from batch, state COMPLETE

Any help would be greatly appreciated, thanks.  If you need any more
information about our cluster hardward/software setup just ask.

Thanks,
Andy O'Hara
Haverford College Physics '09



More information about the torqueusers mailing list