[torqueusers] Problem with running jobs requesting multiple nodes

Jack Hill jackhill at unc.edu
Thu Oct 24 15:24:17 MDT 2013


Hello,

I'm running torque 4.1.7 (using pbs_sched) on a small (12 node, 768 core) 
cluster. I can submit jobs that request a single node (even multiple 
processors one that node) and they run as expected. However, if I submit a 
job that requests more than one node, it will be queued, but never begins 
to run. Suggestions on troubleshooting steps would be appreciated.

Output of qmgr -c 'p s'
"""
#
# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.nodes = 1
set queue batch resources_available.nodect = 12
set queue batch resources_available.nodes = 12
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = tranquility-private
set server acl_hosts += tranquility
set server managers = root at tranquility
set server managers += root at tranquility-private
set server operators = root at tranquility
set server operators += root at tranquility-private
set server default_queue = batch
set server log_events = 511
set server mail_from = tranquility-torque at physics.unc.edu
set server query_other_jobs = True
set server scheduler_iteration = 60
set server node_check_rate = 120
set server tcp_timeout = 300
set server node_pack = True
set server job_stat_rate = 45
set server poll_jobs = True
set server mom_job_sync = True
set server mail_domain = email.unc.edu
set server keep_completed = 300
set server next_job_number = 43386
set server moab_array_compatible = True
"""

Job script:
"""
#!/bin/sh
#
#This is an example script example.sh
#
#These commands set up the Torque Environment for your job:
#PBS -N Primes
#PBS -l walltime=01:00:00
#PBS -q batch
#PBS -M jackhill at unc.edu
#PBS -m abe
#PBS -l nodes=2:ppn=2

/home/jackhill/primes.pl
"""

qstat -f:
"""
Job Id: 43385.tranquility-private
     Job_Name = Primes
     Job_Owner = jackhill at tranquility-private
     job_state = Q
     queue = batch
     server = tranquility.physics.unc.edu
     Checkpoint = u
     ctime = Thu Oct 24 16:46:18 2013
     Error_Path = tranquility.physics.unc.edu:/home/jackhill/Primes.e43385
     exec_port = 15003+15003+15003+15003
     Hold_Types = n
     Join_Path = n
     Keep_Files = n
     Mail_Points = abe
     Mail_Users = jackhill+torque at email.unc.edu
     mtime = Thu Oct 24 17:18:02 2013
     Output_Path = tranquility.physics.unc.edu:/home/jackhill/Primes.o43385
     Priority = 0
     qtime = Thu Oct 24 16:46:18 2013
     Rerunable = True
     Resource_List.nodect = 2
     Resource_List.nodes = 2:ppn=2
     Resource_List.walltime = 01:00:00
     Variable_List = PBS_O_QUEUE=batch,PBS_O_HOME=/home/jackhill,
         PBS_O_LOGNAME=jackhill,

PBS_O_PATH=/opt/precise/torque/4.1.7/bin:/opt/precise/torque/4.1.7/sb

in:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:~/bin,
         PBS_O_MAIL=/var/mail/jackhill,PBS_O_SHELL=/usr/bin/tcsh,
         PBS_O_LANG=en_US.UTF-8,PBS_O_WORKDIR=/home/jackhill,
         PBS_O_HOST=tranquility.physics.unc.edu,
         PBS_O_SERVER=tranquility-private
     comment = Job started on Thu Oct 24 at 17:17
     etime = Thu Oct 24 16:46:18 2013
     exit_status = -3
     submit_args = example3.sh
     start_time = Thu Oct 24 17:18:02 2013
     start_count = 77
     fault_tolerant = False
     job_radix = 0
     submit_host = tranquility.physics.unc.edu
"""

end of pbs_sched log:
"""
10/24/2013 17:11:22;0040; pbs_sched.16639;Job;43386.tranquility-private;Job Run
10/24/2013 17:11:47;0040; pbs_sched.16639;Job;43385.tranquility-private;Job Run
10/24/2013 17:12:12;0040; pbs_sched.16639;Job;43385.tranquility-private;Job Run
10/24/2013 17:12:37;0040; pbs_sched.16639;Job;43385.tranquility-private;Job Run
10/24/2013 17:13:02;0040; pbs_sched.16639;Job;43385.tranquility-private;Job Run
10/24/2013 17:13:27;0040; pbs_sched.16639;Job;43385.tranquility-private;Job Run
10/24/2013 17:13:52;0040; pbs_sched.16639;Job;43385.tranquility-private;Job Run
10/24/2013 17:14:17;0040; pbs_sched.16639;Job;43385.tranquility-private;Job Run
10/24/2013 17:14:42;0040; pbs_sched.16639;Job;43385.tranquility-private;Job Run
10/24/2013 17:15:07;0040; pbs_sched.16639;Job;43385.tranquility-private;Job Run
10/24/2013 17:15:32;0040; pbs_sched.16639;Job;43385.tranquility-private;Job Run
10/24/2013 17:15:57;0040; pbs_sched.16639;Job;43385.tranquility-private;Job Run
10/24/2013 17:16:22;0040; pbs_sched.16639;Job;43385.tranquility-private;Job Run
10/24/2013 17:16:47;0040; pbs_sched.16639;Job;43385.tranquility-private;Job Run
10/24/2013 17:17:12;0040; pbs_sched.16639;Job;43385.tranquility-private;Job Run
10/24/2013 17:17:37;0040; pbs_sched.16639;Job;43385.tranquility-private;Job Run
10/24/2013 17:18:02;0040; pbs_sched.16639;Job;43385.tranquility-private;Job Run
10/24/2013 17:18:27;0040; pbs_sched.16639;Job;43385.tranquility-private;Job Run
10/24/2013 17:18:52;0040; pbs_sched.16639;Job;43385.tranquility-private;Job Run
"""

end of pbs_server log:
"""
10/24/2013 17:17:37;0008;PBS_Server.19579;Job;43385.tranquility-private;Job Run at request of Scheduler at tranquility.physics.unc.edu
10/24/2013 17:17:37;0040;PBS_Server.19579;Svr;tranquility.physics.unc.edu;Scheduler was sent the command new
10/24/2013 17:18:02;0008;PBS_Server.19590;Job;43385.tranquility-private;Job Run at request of Scheduler at tranquility.physics.unc.edu
10/24/2013 17:18:02;0040;PBS_Server.19590;Svr;tranquility.physics.unc.edu;Scheduler was sent the command new
10/24/2013 17:18:27;0008;PBS_Server.19758;Job;43385.tranquility-private;Job Modified at request of Scheduler at tranquility.physics.unc.edu
10/24/2013 17:18:27;0008;PBS_Server.19758;Job;43385.tranquility-private;Job Run at request of Scheduler at tranquility.physics.unc.edu
10/24/2013 17:18:27;0040;PBS_Server.19758;Svr;tranquility.physics.unc.edu;Scheduler was sent the command new
10/24/2013 17:18:52;0008;PBS_Server.19768;Job;43385.tranquility-private;Job Run at request of Scheduler at tranquility.physics.unc.edu
10/24/2013 17:18:52;0040;PBS_Server.19768;Svr;tranquility.physics.unc.edu;Scheduler was sent the command new
10/24/2013 17:19:17;0008;PBS_Server.19771;Job;43385.tranquility-private;Job Run at request of Scheduler at tranquility.physics.unc.edu
10/24/2013 17:19:17;0040;PBS_Server.19771;Svr;tranquility.physics.unc.edu;Scheduler was sent the command new
10/24/2013 17:19:42;0008;PBS_Server.19774;Job;43385.tranquility-private;Job Modified at request of Scheduler at tranquility.physics.unc.edu
10/24/2013 17:19:42;0008;PBS_Server.19774;Job;43385.tranquility-private;Job Run at request of Scheduler at tranquility.physics.unc.edu
10/24/2013 17:19:42;0040;PBS_Server.19774;Svr;tranquility.physics.unc.edu;Scheduler was sent the command new
10/24/2013 17:20:07;0008;PBS_Server.19778;Job;43385.tranquility-private;Job Run at request of Scheduler at tranquility.physics.unc.edu
10/24/2013 17:20:07;0040;PBS_Server.19778;Svr;tranquility.physics.unc.edu;Scheduler was sent the command new
10/24/2013 17:20:32;0008;PBS_Server.19806;Job;43385.tranquility-private;Job Modified at request of Scheduler at tranquility.physics.unc.edu
10/24/2013 17:20:32;0008;PBS_Server.19806;Job;43385.tranquility-private;Job Run at request of Scheduler at tranquility.physics.unc.edu
10/24/2013 17:20:32;0040;PBS_Server.19806;Svr;tranquility.physics.unc.edu;Scheduler was sent the command new
"""

To my untraned eye, it seems like the scheduler is trying to run to job, 
but it is for some reason ending up back in the queue, so the scheduler is 
being asked to schedule it again.

Thanks,
Jack


More information about the torqueusers mailing list