[torqueusers] Jobs in queue do not get started

vlad at cosy.sbg.ac.at vlad at cosy.sbg.ac.at
Mon Oct 3 07:27:40 MDT 2011


We have setup a queue,are even able to queue MPI  jobs, which get 
processed when the resources at first are all free.

Submitting then more  jobs  as the system is capable to process  at once,
the  jobs  get queued (which is good and the purpose of all that), but..
they  never  leave the queue and are not  processed.
In the meantime the nodes get free and idle and are twisting their thumbs..

As I already wrote in previous  mails  we have  Torque 3.0.3 snapshot and
Maui version 3.3.1 running.

so when "qstat" shows me  that job #XYZ  is "Q" it stays that way until
the end of time...

tracejob  shows the activity of the jobs, but it reveals no error messages...

Eventually  I had them deleted  with qdel and they were removed.

Our pbs_server configuration :

qmgr -c 'p s'
# Create queues and set their attributes.
# Create and define queue gpushort
create queue gpushort
set queue gpushort queue_type = Execution
set queue gpushort resources_min.nodes = 1
set queue gpushort resources_default.neednodes = gpunode
set queue gpushort resources_default.nodes = 1
set queue gpushort resources_default.walltime = 24:00:00
set queue gpushort enabled = True
set queue gpushort started = True
(more queues to follow, all unused at the moment ..)
(pbs_server configuration:)
# Set server attributes.
set server scheduling = True
set server acl_hosts = gpu
set server managers = forsthof at gpu
set server managers += peter at gpu
set server managers += root at gpu
set server managers += vlad at gpu
set server operators = forsthof at gpu
set server operators += peter at gpu
set server operators += root at gpu
set server operators += vlad at gpu
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server log_level = 7
set server mom_job_sync = True
set server keep_completed = 300
set server next_job_number = 247

Our submit script ..

#This is an example script example.sh
#These commands set up the Grid Environment for your job:
#PBS -l nodes=7

#PBS -q gpushort
#PBS -m abe

 np=$(cat $PBS_NODEFILE | wc -l)
#print the time and date
date >> /tmp/start.txt
echo "/usr/mpi/gcc/openmpi-1.4.3/bin/mpirun --hostfile $PBS_NODEFILE 
r/queuing/cpi" >> /tmp/start.txt

/usr/mpi/gcc/openmpi-1.4.3/bin/mpirun -n 28 --hostfile $PBS_NODEFILE 
/home/user/queuing/integrate_queued 100000000 100000

Funny:  Even though  only 7 nodes are requested  the 28 processes are
started well until the  ressources are exhausted.  After that they get
queued forever.. and are never started again ..

Any clues ?


Vlad Popa

More information about the torqueusers mailing list