[torqueusers] Trouble running jobs with TORQUE

Garrick Staples garrick at clusterresources.com
Wed Mar 28 16:00:14 MDT 2007


On Mon, Mar 26, 2007 at 05:48:45PM -0400, aohara at haverford.edu alleged:
> Hi,
> We just recently began setting up a linux cluster here at Haverford
> College using TORQUE and Maui.  The general specs are 6 blades with two
> dual core AMD opterons, 16 gb ram, and a head node with a similar
> processor setup.
> Over the past week, we installed TORQUE (and Maui), however TORQUE seems
> to be having trouble running jobs.
> Running 'pbsnodes -a' reports correctly on the state of all nodes and if
> neither pbs_sched or Maui are running then qstat shows jobs labeled Q, as
> expected.  However, when either pbs_sched or Maui are running, the jobs
> don't seem to be running properly.  I tried submitting both the test
> phrase `echo "sleep 30" | qsub' and a script `qsub testjob' where testjob
> is a script containing `python myprogram.py'.  All necessary python
> packages are installed too, so I know this isn't the problem (I've
> manually ran the python code on all nodes).  The reason I suspect some
> form of TORQUE error is that this job also completes immediately, even tho
> it should take roughly 20 minutes to run.  The tracejob output for one is
> here (both are basically the same though):
> 
> 03/26/2007 17:25:17  S    enqueuing into batch, state 1 hop 1
> 03/26/2007 17:25:17  S    Job Queued at request of administrator at babbage,
> owner
>                           = administrator at babbage, job name = testjob.sh,
> queue
>                           = batch
> 03/26/2007 17:25:18  S    Job Modified at request of root at babbage
> 03/26/2007 17:25:18  S    Job Run at request of root at babbage
> 03/26/2007 17:25:18  S    Job Modified at request of root at babbage
> 03/26/2007 17:25:18  S    Exit_status=-1
> 03/26/2007 17:25:18  S    Post job file processing error
> 03/26/2007 17:25:18  S    dequeuing from batch, state COMPLETE
> 
> Any help would be greatly appreciated, thanks.  If you need any more
> information about our cluster hardward/software setup just ask.

Have you checked syslog?  A "Post job file processing error" means that
the final stages of launching the job failed, when it job-child process
is actually being setup.

There should be errors in syslog of the execution node, or in the STDERR
output of the job (which could be sitting the undelivered directory of
the execution node if your output copying isn't configured yet.)



More information about the torqueusers mailing list