[torqueusers] Two problems with a routing queue
Eirikur.Hjartarson at decode.is
Fri Sep 16 10:08:01 MDT 2011
> -----Original Message-----
> From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
> bounces at supercluster.org] On Behalf Of Ken Nielson
> Sent: 16. september 2011 13:07
> To: Torque Users Mailing List
> Subject: Re: [torqueusers] Two problems with a routing queue
> > Hi,
> > I'm resubmitting these questions since I got no replies to them one
> > week ago.
> > In order to limit the number of jobs that maui considers for
> > scheduling, we have a routing queue setup,
> > #
> > # Create and define queue exec
> > #
> > create queue exec
> > set queue exec queue_type = Route
> > set queue exec route_destinations = real_exec
> > set queue exec route_held_jobs = False
> > set queue exec enabled = True
> > set queue exec started = True
> > #
> > # Create and define queue real_exec
> > #
> > create queue real_exec
> > set queue real_exec queue_type = Execution
> > set queue real_exec max_user_queuable = 800
> > set queue real_exec from_route_only = True
> > set queue real_exec resources_default.nodes = 1
> > set queue real_exec enabled = True
> > set queue real_exec started = True
> > (800 is a bit higher than the number of CPUs in the cluster)
> > There are two problems that we have experienced with this setup.
> > 1.
> > A job (id: 28379062), that is still on the "exec" queue and depends
> > on another job (id: 28379059) that finishes *before* the job (id:
> > 28379062) is put on the "real_exec" queue will generate the
> > following error mail, when it (id: 28379062) is transferred to the
> > "real_exec" queue.
> > ---
> > PBS Job Id: 28379062.lpbs2.decode.is
> > Job Name: bambino_22892
> > Aborted by PBS Server
> > Dependency request for job rejected by 28379059.lpbs2.decode.is
> > Unknown Job Id Job held for unknown job dep, use 'qrls' to release
> > ---
> > Is there any way to solve this problem, other than setting the
> > keep_completed attribute to some non-zero value? The problem with
> > the keep_completed attribute is that we (think we) have to set it to
> > a big value, say, one day.
> When you set a job dependency TORQUE needs to know which job and
> under what conditions. If there is no record of a job TORQUE does not know
> what to do. Did the job finish correctly? did it fail?
> Why can you not submit the second job while the first job is still available?
Thanks for your response, I probably did a bad job of explaining the problem.
The jobs were submitted to the "exec" queue at the same time. Now the first job (28379059) is moved to the "real_exec" queue and finishes executing before the second job (28379062) is moved to the "real_exec" queue. At that time, when the second job is moved to the "real_exec" queue, the error mail is sent.
This problem is solvable by setting the "keep_completed" attribute for the "real_exec" queue to some non-zero value. In our case that may be several hours and e.g. output from "qstat" is cluttered by information on completed jobs. Which is why I am asking if there is some other solution.
The second problem I mentioned is more critical for us, It seems that jobs that are on system hold (because of dependencies) are transferred from the "exec" queue to the "real_exec" queue, regardless of the setting of the "route_held_jobs" attribute. On the other hand, jobs, with user holds, stay on the "exec" queue if the "route_held_jobs" attribute is set.
Eiríkur Hjartarson E-mail: Eirikur.Hjartarson at decode.is
Íslensk Erfðagreining Mobile: +3546641898
More information about the torqueusers