[torqueusers] Jobs in queue do not get started
vlad at cosy.sbg.ac.at
vlad at cosy.sbg.ac.at
Mon Oct 3 07:27:40 MDT 2011
We have setup a queue,are even able to queue MPI jobs, which get
processed when the resources at first are all free.
Submitting then more jobs as the system is capable to process at once,
the jobs get queued (which is good and the purpose of all that), but..
they never leave the queue and are not processed.
In the meantime the nodes get free and idle and are twisting their thumbs..
As I already wrote in previous mails we have Torque 3.0.3 snapshot and
Maui version 3.3.1 running.
so when "qstat" shows me that job #XYZ is "Q" it stays that way until
the end of time...
tracejob shows the activity of the jobs, but it reveals no error messages...
Eventually I had them deleted with qdel and they were removed.
Our pbs_server configuration :
qmgr -c 'p s'
# Create queues and set their attributes.
# Create and define queue gpushort
create queue gpushort
set queue gpushort queue_type = Execution
set queue gpushort resources_min.nodes = 1
set queue gpushort resources_default.neednodes = gpunode
set queue gpushort resources_default.nodes = 1
set queue gpushort resources_default.walltime = 24:00:00
set queue gpushort enabled = True
set queue gpushort started = True
(more queues to follow, all unused at the moment ..)
# Set server attributes.
set server scheduling = True
set server acl_hosts = gpu
set server managers = forsthof at gpu
set server managers += peter at gpu
set server managers += root at gpu
set server managers += vlad at gpu
set server operators = forsthof at gpu
set server operators += peter at gpu
set server operators += root at gpu
set server operators += vlad at gpu
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server log_level = 7
set server mom_job_sync = True
set server keep_completed = 300
set server next_job_number = 247
Our submit script ..
#This is an example script example.sh
#These commands set up the Grid Environment for your job:
#PBS -N CPI
#PBS -l nodes=7
#PBS -q gpushort
#PBS -m abe
np=$(cat $PBS_NODEFILE | wc -l)
#print the time and date
date >> /tmp/start.txt
echo "/usr/mpi/gcc/openmpi-1.4.3/bin/mpirun --hostfile $PBS_NODEFILE
r/queuing/cpi" >> /tmp/start.txt
/usr/mpi/gcc/openmpi-1.4.3/bin/mpirun -n 28 --hostfile $PBS_NODEFILE
/home/user/queuing/integrate_queued 100000000 100000
Funny: Even though only 7 nodes are requested the 28 processes are
started well until the ressources are exhausted. After that they get
queued forever.. and are never started again ..
Any clues ?
More information about the torqueusers