[torqueusers] Two problems with a routing queue
Eirikur.Hjartarson at decode.is
Wed Sep 7 02:22:31 MDT 2011
In order to limit the number of jobs that maui considers for scheduling, we have a routing queue setup,
# Create and define queue exec
create queue exec
set queue exec queue_type = Route
set queue exec route_destinations = real_exec
set queue exec route_held_jobs = False
set queue exec enabled = True
set queue exec started = True
# Create and define queue real_exec
create queue real_exec
set queue real_exec queue_type = Execution
set queue real_exec max_user_queuable = 800
set queue real_exec from_route_only = True
set queue real_exec resources_default.nodes = 1
set queue real_exec enabled = True
set queue real_exec started = True
(800 is a bit higher than the number of CPUs in the cluster)
There are two problems that we have experienced with this setup.
A job (id: 28379062), that is still on the "exec" queue and depends on another job (id: 28379059) that finishes *before* the job (id: 28379062) is put on the "real_exec" queue will generate the following error mail, when it (id: 28379062) is transferred to the "real_exec" queue.
PBS Job Id: 28379062.lpbs2.decode.is
Job Name: bambino_22892
Aborted by PBS Server
Dependency request for job rejected by 28379059.lpbs2.decode.is Unknown Job Id Job held for unknown job dep, use 'qrls' to release
Is there any way to solve this problem, other than setting the keep_completed attribute to some non-zero value? The problem with the keep_completed attribute is that we (think we) have to set it to a big value, say, one day.
The "real_exec" queue may get filled up with jobs that all depend on a job that is still on the "exec" queue. It seems possible to me that the route_held_jobs attribute only applies to user holds. If that is correct, would it be possible to let it also apply to system holds?
More information about the torqueusers