[gold-users] speeding up gold reservations

Stijn De Weirdt stijn.deweirdt at ugent.be
Thu Apr 16 05:45:18 MDT 2009


> Hi Stijn,
> 
> Sorry to take so long to reply .... you know the story.
> 
no problem. glad you never delete your spam ;)

> > i'm running a maui 3.2.6p21/torque 2.3.6 with GOLD as AM and we have
> > some issues with users submitting large (as in 10+k) amounts of short
> > (5-10 minutes) jobs to the queue and this has been choking up the system
> > somewhat.
> >
> >   
> Wow. That's alot:)
the scary part is that users think it isn't ;)

> 
> > one factor in this whole process is that gold slows things down a lot. i
> > see gold reservation requests (Successfully reserved X credits for job
> > Y) when each job enter the maui queue (if i phrase this correctly), and
> > one when the job is done.
> > the first request is the most limiting one, as all new jobs in the queue
> > are processed on entering (although i have MAXIJOB set rather low, so
> > almost all of these jobs enter as blocked jobs anyway). each request
> > takes approx 600-700ms, and because the jobs finish more quickly then
> > the time needed for maui to add newly submitted jobs (not all of them,
> > but still a lot), cluster usage is spiked. 
> >
> >   
> I would expect the gold calls to happen when the jobs are started, not 
> when they are submitted. Some sites use a submit filter to check for a 
> reasonable balance when the job is submitted to prevent it being held if 
> it is later found to be out of credits, but this is entirely optional.
> 
what i see from the logs (when using torque/maui) is that after you
submit a job to torque, maui polls torque for yet unknown jobs. when
these job are seen by maui (and enter what i call the maui queue), they
are checked with gold if enough credits exist (irrespective if the job
is considered blocked or not).
when maui starts the job, it checks gold again (the real reservation i
assume) and then once more at the end of the job. 

> > maui is annoying that it doesn't make these request in parallel or
> > doesn't make them at all since these will be blocked jobs anyway. if the
> > gold requests where made when unblocking the jobs, at least the usage
> > would be more optimal.
> >
> >   
> I am surprised by this behavior. I am assuming you are meaning that the 
> jobs run shortly after being unblocked and that you would prefer Maui 
> reserve the jobs in Gold at this time instead of at submit time. I to 
> believe that is what it should be doing.
i could check it in the code, but guess that maui first checks gold and
only then determines if the job should be blocked or not.
(the most annoying situation is when you restart maui, it rediscovers
all jobs in torque (easily taking up 1-2 hours of maui only interacting
with gold, so no scheduling activity) and people submitting lots of
small jobs that could already start)

do you know if moab does this too? (ie verifiying gold for jobs that
enter but will be blocked)

>  Perhaps you can present the 
> evidence of this behavior in the maui logs. Or perhaps I am 
> misunderstanding your statement. 
should be very easy to do. i'll collect the necessary logs when i have
some more time.

> It is true that neither Maui nor Moab 
> currently batch the gold requests (probably primarily due to the fact 
> that there is no current support in Gold for batched requests). [That 
> might not be entirely true -- Gold may support it if you were to use the 
> perl API, I'd have to look.]
batch request should work for jobs that were submitted by same user.
processing all jobs in batches might result in a speedup, but grouping
request from same user (and maybe even same walltime/number of nodes or
other parameters) should do it (it's not that the connection itself to
gold slows things down)

> 
> > but does anyone have any tips to speedup individual gold queries?
> >
> >   
> Yes, Are you already using the new indexes? We've recently introduced 
> indexes into the Gold tables which roughly speeds things up by 10x. 
> Also, if your database has been in use for awhile (weeks or months), you 
> will need to VACUUM it periodically to keep the queries quick (this also 
> can make a very large difference).
we are using 2.1.7.1, i assume that these indexes are in there.

> > i have a tip myself: there is a certain SQL query (see bottom of mail)
> > that is executed rather slowly with the default schema (it's not cached
> > by the DB unlike almost almost all other SELECT SQL queries from gold).
> > (it is actually the slowest of them all, taking approx 400-500ms of teh
> > total 600-700ms).
> > we first had teh MySQL as DB, but we switched to postgres 8.3.6 (this
> > gave 10-15% speedup), but i found that adding another 2 partial indexes
> > improved this query to approx 150-200ms).
> >
> > CREATE INDEX g_reservation_not_deleted_start_idx ON g_reservation
> > (g_start_time) WHERE g_deleted!='True';
> > CREATE INDEX g_resallo_not_deleted_id_idx ON g_reservation_allocation
> > (g_id) WHERE g_deleted!='True';
> >
> > ANALYZE g_reservation;
> > ANALYZE g_reservation_allocation;
> >
> >   
> 
> Let me know if you can pinpoint anything else that can be improved and 
> we can either address it or put it in as a feature request.
the current situation is as follows: we can process a typical request in
300-500ms. this gives 2-3 requests per second (but with the 3 steps
described above means 1-1.5 second gold process time per completed job,
ie max 3600 jobs/hour, which is low).

when i start gold with loglevel TRACE and i check the time spend, half
of it is in DB access with the longest query taking at max 100ms (all
others seem cached). 
this also means that the other half is spend in running "perl". 

speeding up the DB (more indexes, further tuning) seems unlikely to
help. for now the only thing that can help i think is faster hardware. 

we'll see how it goes.

stijn

> 
> Thanks,
> 
> Scott
> > hope this helps,
> >
> > stijn
> >
> > SQL Query: SELECT
> > g_reservation_allocation.g_id,g_reservation_allocation.g_amount FROM
> > g_reservation, g\
> > _reservation_allocation WHERE
> > ( g_reservation.g_id=g_reservation_allocation.g_reservation AND
> > g_reservation.g_start_time<='1236070726' AND g_reservation.g_en\
> > d_time>'1236070726' AND  ( g_reservation_allocation.g_id='35' OR
> > g_reservation_allocation.g_id='20' )  ) AND g_reservation.g_deleted!
> > ='True' AND g_reservatio\
> > n_allocation.g_deleted!='True'
> >
> >
> >
> >
> >
> > _______________________________________________
> > gold-users mailing list
> > gold-users at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/gold-users
> >   
> 
-- 
The system will shutdown in 5 minutes.



More information about the gold-users mailing list