[torqueusers] Problem with running jobs requesting multiple nodes

Gus Correa gus at ldeo.columbia.edu
Thu Oct 24 17:10:51 MDT 2013


Hello Jack

Have you tried Maui instead of pbs_sched?

See these threads:
http://www.supercluster.org/pipermail/torqueusers/2013-October/016264.html
http://www.supercluster.org/pipermail/torqueusers/2013-September/016125.html
http://www.supercluster.org/pipermail/torqueusers/2013-September/016072.html

IHIH
Gus Correa

On 10/24/2013 05:24 PM, Jack Hill wrote:
> Hello,
>
> I'm running torque 4.1.7 (using pbs_sched) on a small (12 node, 768 core)
> cluster. I can submit jobs that request a single node (even multiple
> processors one that node) and they run as expected. However, if I submit a
> job that requests more than one node, it will be queued, but never begins
> to run. Suggestions on troubleshooting steps would be appreciated.
>
> Output of qmgr -c 'p s'
> """
> #
> # Create queues and set their attributes.
> #
> #
> # Create and define queue batch
> #
> create queue batch
> set queue batch queue_type = Execution
> set queue batch resources_default.nodes = 1
> set queue batch resources_available.nodect = 12
> set queue batch resources_available.nodes = 12
> set queue batch enabled = True
> set queue batch started = True
> #
> # Set server attributes.
> #
> set server scheduling = True
> set server acl_hosts = tranquility-private
> set server acl_hosts += tranquility
> set server managers = root at tranquility
> set server managers += root at tranquility-private
> set server operators = root at tranquility
> set server operators += root at tranquility-private
> set server default_queue = batch
> set server log_events = 511
> set server mail_from = tranquility-torque at physics.unc.edu
> set server query_other_jobs = True
> set server scheduler_iteration = 60
> set server node_check_rate = 120
> set server tcp_timeout = 300
> set server node_pack = True
> set server job_stat_rate = 45
> set server poll_jobs = True
> set server mom_job_sync = True
> set server mail_domain = email.unc.edu
> set server keep_completed = 300
> set server next_job_number = 43386
> set server moab_array_compatible = True
> """
>
> Job script:
> """
> #!/bin/sh
> #
> #This is an example script example.sh
> #
> #These commands set up the Torque Environment for your job:
> #PBS -N Primes
> #PBS -l walltime=01:00:00
> #PBS -q batch
> #PBS -M jackhill at unc.edu
> #PBS -m abe
> #PBS -l nodes=2:ppn=2
>
> /home/jackhill/primes.pl
> """
>
> qstat -f:
> """
> Job Id: 43385.tranquility-private
>       Job_Name = Primes
>       Job_Owner = jackhill at tranquility-private
>       job_state = Q
>       queue = batch
>       server = tranquility.physics.unc.edu
>       Checkpoint = u
>       ctime = Thu Oct 24 16:46:18 2013
>       Error_Path = tranquility.physics.unc.edu:/home/jackhill/Primes.e43385
>       exec_port = 15003+15003+15003+15003
>       Hold_Types = n
>       Join_Path = n
>       Keep_Files = n
>       Mail_Points = abe
>       Mail_Users = jackhill+torque at email.unc.edu
>       mtime = Thu Oct 24 17:18:02 2013
>       Output_Path = tranquility.physics.unc.edu:/home/jackhill/Primes.o43385
>       Priority = 0
>       qtime = Thu Oct 24 16:46:18 2013
>       Rerunable = True
>       Resource_List.nodect = 2
>       Resource_List.nodes = 2:ppn=2
>       Resource_List.walltime = 01:00:00
>       Variable_List = PBS_O_QUEUE=batch,PBS_O_HOME=/home/jackhill,
>           PBS_O_LOGNAME=jackhill,
>
> PBS_O_PATH=/opt/precise/torque/4.1.7/bin:/opt/precise/torque/4.1.7/sb
>
> in:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:~/bin,
>           PBS_O_MAIL=/var/mail/jackhill,PBS_O_SHELL=/usr/bin/tcsh,
>           PBS_O_LANG=en_US.UTF-8,PBS_O_WORKDIR=/home/jackhill,
>           PBS_O_HOST=tranquility.physics.unc.edu,
>           PBS_O_SERVER=tranquility-private
>       comment = Job started on Thu Oct 24 at 17:17
>       etime = Thu Oct 24 16:46:18 2013
>       exit_status = -3
>       submit_args = example3.sh
>       start_time = Thu Oct 24 17:18:02 2013
>       start_count = 77
>       fault_tolerant = False
>       job_radix = 0
>       submit_host = tranquility.physics.unc.edu
> """
>
> end of pbs_sched log:
> """
> 10/24/2013 17:11:22;0040; pbs_sched.16639;Job;43386.tranquility-private;Job Run
> 10/24/2013 17:11:47;0040; pbs_sched.16639;Job;43385.tranquility-private;Job Run
> 10/24/2013 17:12:12;0040; pbs_sched.16639;Job;43385.tranquility-private;Job Run
> 10/24/2013 17:12:37;0040; pbs_sched.16639;Job;43385.tranquility-private;Job Run
> 10/24/2013 17:13:02;0040; pbs_sched.16639;Job;43385.tranquility-private;Job Run
> 10/24/2013 17:13:27;0040; pbs_sched.16639;Job;43385.tranquility-private;Job Run
> 10/24/2013 17:13:52;0040; pbs_sched.16639;Job;43385.tranquility-private;Job Run
> 10/24/2013 17:14:17;0040; pbs_sched.16639;Job;43385.tranquility-private;Job Run
> 10/24/2013 17:14:42;0040; pbs_sched.16639;Job;43385.tranquility-private;Job Run
> 10/24/2013 17:15:07;0040; pbs_sched.16639;Job;43385.tranquility-private;Job Run
> 10/24/2013 17:15:32;0040; pbs_sched.16639;Job;43385.tranquility-private;Job Run
> 10/24/2013 17:15:57;0040; pbs_sched.16639;Job;43385.tranquility-private;Job Run
> 10/24/2013 17:16:22;0040; pbs_sched.16639;Job;43385.tranquility-private;Job Run
> 10/24/2013 17:16:47;0040; pbs_sched.16639;Job;43385.tranquility-private;Job Run
> 10/24/2013 17:17:12;0040; pbs_sched.16639;Job;43385.tranquility-private;Job Run
> 10/24/2013 17:17:37;0040; pbs_sched.16639;Job;43385.tranquility-private;Job Run
> 10/24/2013 17:18:02;0040; pbs_sched.16639;Job;43385.tranquility-private;Job Run
> 10/24/2013 17:18:27;0040; pbs_sched.16639;Job;43385.tranquility-private;Job Run
> 10/24/2013 17:18:52;0040; pbs_sched.16639;Job;43385.tranquility-private;Job Run
> """
>
> end of pbs_server log:
> """
> 10/24/2013 17:17:37;0008;PBS_Server.19579;Job;43385.tranquility-private;Job Run at request of Scheduler at tranquility.physics.unc.edu
> 10/24/2013 17:17:37;0040;PBS_Server.19579;Svr;tranquility.physics.unc.edu;Scheduler was sent the command new
> 10/24/2013 17:18:02;0008;PBS_Server.19590;Job;43385.tranquility-private;Job Run at request of Scheduler at tranquility.physics.unc.edu
> 10/24/2013 17:18:02;0040;PBS_Server.19590;Svr;tranquility.physics.unc.edu;Scheduler was sent the command new
> 10/24/2013 17:18:27;0008;PBS_Server.19758;Job;43385.tranquility-private;Job Modified at request of Scheduler at tranquility.physics.unc.edu
> 10/24/2013 17:18:27;0008;PBS_Server.19758;Job;43385.tranquility-private;Job Run at request of Scheduler at tranquility.physics.unc.edu
> 10/24/2013 17:18:27;0040;PBS_Server.19758;Svr;tranquility.physics.unc.edu;Scheduler was sent the command new
> 10/24/2013 17:18:52;0008;PBS_Server.19768;Job;43385.tranquility-private;Job Run at request of Scheduler at tranquility.physics.unc.edu
> 10/24/2013 17:18:52;0040;PBS_Server.19768;Svr;tranquility.physics.unc.edu;Scheduler was sent the command new
> 10/24/2013 17:19:17;0008;PBS_Server.19771;Job;43385.tranquility-private;Job Run at request of Scheduler at tranquility.physics.unc.edu
> 10/24/2013 17:19:17;0040;PBS_Server.19771;Svr;tranquility.physics.unc.edu;Scheduler was sent the command new
> 10/24/2013 17:19:42;0008;PBS_Server.19774;Job;43385.tranquility-private;Job Modified at request of Scheduler at tranquility.physics.unc.edu
> 10/24/2013 17:19:42;0008;PBS_Server.19774;Job;43385.tranquility-private;Job Run at request of Scheduler at tranquility.physics.unc.edu
> 10/24/2013 17:19:42;0040;PBS_Server.19774;Svr;tranquility.physics.unc.edu;Scheduler was sent the command new
> 10/24/2013 17:20:07;0008;PBS_Server.19778;Job;43385.tranquility-private;Job Run at request of Scheduler at tranquility.physics.unc.edu
> 10/24/2013 17:20:07;0040;PBS_Server.19778;Svr;tranquility.physics.unc.edu;Scheduler was sent the command new
> 10/24/2013 17:20:32;0008;PBS_Server.19806;Job;43385.tranquility-private;Job Modified at request of Scheduler at tranquility.physics.unc.edu
> 10/24/2013 17:20:32;0008;PBS_Server.19806;Job;43385.tranquility-private;Job Run at request of Scheduler at tranquility.physics.unc.edu
> 10/24/2013 17:20:32;0040;PBS_Server.19806;Svr;tranquility.physics.unc.edu;Scheduler was sent the command new
> """
>
> To my untraned eye, it seems like the scheduler is trying to run to job,
> but it is for some reason ending up back in the queue, so the scheduler is
> being asked to schedule it again.
>
> Thanks,
> Jack
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list