[torqueusers] Two problems with a routing queue
gus at ldeo.columbia.edu
Fri Sep 16 10:28:21 MDT 2011
Ken Nielson wrote:
> ----- Original Message -----
>> From: "Eiríkur Hjartarson" <Eirikur.Hjartarson at decode.is>
>> To: torqueusers at supercluster.org
>> Sent: Friday, September 16, 2011 3:57:13 AM
>> Subject: [torqueusers] Two problems with a routing queue
>> I'm resubmitting these questions since I got no replies to them one
>> week ago.
>> In order to limit the number of jobs that maui considers for
>> scheduling, we have a routing queue setup,
>> # Create and define queue exec
>> create queue exec
>> set queue exec queue_type = Route
>> set queue exec route_destinations = real_exec
>> set queue exec route_held_jobs = False
>> set queue exec enabled = True
>> set queue exec started = True
>> # Create and define queue real_exec
>> create queue real_exec
>> set queue real_exec queue_type = Execution
>> set queue real_exec max_user_queuable = 800
>> set queue real_exec from_route_only = True
>> set queue real_exec resources_default.nodes = 1
>> set queue real_exec enabled = True
>> set queue real_exec started = True
>> (800 is a bit higher than the number of CPUs in the cluster)
>> There are two problems that we have experienced with this setup.
>> A job (id: 28379062), that is still on the "exec" queue and depends
>> on another job (id: 28379059) that finishes *before* the job (id:
>> 28379062) is put on the "real_exec" queue will generate the
>> following error mail, when it (id: 28379062) is transferred to the
>> "real_exec" queue.
>> PBS Job Id: 28379062.lpbs2.decode.is
>> Job Name: bambino_22892
>> Aborted by PBS Server
>> Dependency request for job rejected by 28379059.lpbs2.decode.is
>> Unknown Job Id Job held for unknown job dep, use 'qrls' to release
>> Is there any way to solve this problem, other than setting the
>> keep_completed attribute to some non-zero value? The problem with
>> the keep_completed attribute is that we (think we) have to set it to
>> a big value, say, one day.
> When you set a job dependency TORQUE needs to know which job and under what conditions.
> If there is no record of a job TORQUE does not know what to do.
> Did the job finish correctly? did it fail?
> Why can you not submit the second job while the first job is still available?
> Ken Nielson
> Adaptive Computing
Hi Eirikur and Ken
I had a similar problem some time ago,
and I found it useful to extend the time of completed jobs on the queue.
Note that the unit used is seconds.
If you don't have a high volume of jobs this is not a problem
qmgr -c 'set server keep_completed = the number of seconds you want'
Also, and this may be a question to Ken as well.
What makes 'afterok' to be true?
Is it an empty stderr?
Often times programs dump warning messages [not errors] in stderr,
the job ends 'OK' but stderr is not empty.
I prefer to use 'afterany' because of this doubt.
More information about the torqueusers