[torqueusers] Two problems with a routing queue
knielson at adaptivecomputing.com
Fri Sep 16 07:07:19 MDT 2011
----- Original Message -----
> From: "Eiríkur Hjartarson" <Eirikur.Hjartarson at decode.is>
> To: torqueusers at supercluster.org
> Sent: Friday, September 16, 2011 3:57:13 AM
> Subject: [torqueusers] Two problems with a routing queue
> I'm resubmitting these questions since I got no replies to them one
> week ago.
> In order to limit the number of jobs that maui considers for
> scheduling, we have a routing queue setup,
> # Create and define queue exec
> create queue exec
> set queue exec queue_type = Route
> set queue exec route_destinations = real_exec
> set queue exec route_held_jobs = False
> set queue exec enabled = True
> set queue exec started = True
> # Create and define queue real_exec
> create queue real_exec
> set queue real_exec queue_type = Execution
> set queue real_exec max_user_queuable = 800
> set queue real_exec from_route_only = True
> set queue real_exec resources_default.nodes = 1
> set queue real_exec enabled = True
> set queue real_exec started = True
> (800 is a bit higher than the number of CPUs in the cluster)
> There are two problems that we have experienced with this setup.
> A job (id: 28379062), that is still on the "exec" queue and depends
> on another job (id: 28379059) that finishes *before* the job (id:
> 28379062) is put on the "real_exec" queue will generate the
> following error mail, when it (id: 28379062) is transferred to the
> "real_exec" queue.
> PBS Job Id: 28379062.lpbs2.decode.is
> Job Name: bambino_22892
> Aborted by PBS Server
> Dependency request for job rejected by 28379059.lpbs2.decode.is
> Unknown Job Id Job held for unknown job dep, use 'qrls' to release
> Is there any way to solve this problem, other than setting the
> keep_completed attribute to some non-zero value? The problem with
> the keep_completed attribute is that we (think we) have to set it to
> a big value, say, one day.
When you set a job dependency TORQUE needs to know which job and under what conditions. If there is no record of a job TORQUE does not know what to do. Did the job finish correctly? did it fail?
Why can you not submit the second job while the first job is still available?
More information about the torqueusers