[torqueusers] Trouble running jobs with TORQUE

nathaniel.x.woody at gsk.com nathaniel.x.woody at gsk.com
Tue Mar 27 09:26:49 MDT 2007


The off the cuff answer is that there might be a problem with the rsh/ssh 
permissions on the system.  Have you verified that the user submitting the 
job (administrator at babbage) can do a passwordless ssh (assuming you 
configured with --with-scp) to the compute nodes and back to the headnode.

For the test echo 'hostname' | qsub are you getting stdout and stderror 
files back? (STDIN.e123456 looking things)?  If you are, is there anything 
in them?  Is the administrator getting an email about these jobs with any 
information in them? 

A seperate issue with python that I have run into is ensuring that the 
'all set' python setup includes PYTHONPATH being set appropriately in the 
shell that torque opens, if you have installed extra packages.  But any 
problems here should show up in a stack trace in the stderror file and can 
be diagnosed that way.

Hope that gives you a start,
Nate





aohara at haverford.edu 
Sent by: torqueusers-bounces at supercluster.org
26-Mar-2007 17:48
 
To
torqueusers at supercluster.org
cc

Subject
[torqueusers] Trouble running jobs with TORQUE






Hi,
We just recently began setting up a linux cluster here at Haverford
College using TORQUE and Maui.  The general specs are 6 blades with two
dual core AMD opterons, 16 gb ram, and a head node with a similar
processor setup.
Over the past week, we installed TORQUE (and Maui), however TORQUE seems
to be having trouble running jobs.
Running 'pbsnodes -a' reports correctly on the state of all nodes and if
neither pbs_sched or Maui are running then qstat shows jobs labeled Q, as
expected.  However, when either pbs_sched or Maui are running, the jobs
don't seem to be running properly.  I tried submitting both the test
phrase `echo "sleep 30" | qsub' and a script `qsub testjob' where testjob
is a script containing `python myprogram.py'.  All necessary python
packages are installed too, so I know this isn't the problem (I've
manually ran the python code on all nodes).  The reason I suspect some
form of TORQUE error is that this job also completes immediately, even tho
it should take roughly 20 minutes to run.  The tracejob output for one is
here (both are basically the same though):

03/26/2007 17:25:17  S    enqueuing into batch, state 1 hop 1
03/26/2007 17:25:17  S    Job Queued at request of administrator at babbage,
owner
                          = administrator at babbage, job name = testjob.sh,
queue
                          = batch
03/26/2007 17:25:18  S    Job Modified at request of root at babbage
03/26/2007 17:25:18  S    Job Run at request of root at babbage
03/26/2007 17:25:18  S    Job Modified at request of root at babbage
03/26/2007 17:25:18  S    Exit_status=-1
03/26/2007 17:25:18  S    Post job file processing error
03/26/2007 17:25:18  S    dequeuing from batch, state COMPLETE

Any help would be greatly appreciated, thanks.  If you need any more
information about our cluster hardward/software setup just ask.

Thanks,
Andy O'Hara
Haverford College Physics '09
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20070327/9e55d755/attachment.html


More information about the torqueusers mailing list