[gold-users] moab-gold binding performance issue
Scott Jackson
scottmo at clusterresources.com
Thu Jun 4 13:46:40 MDT 2009
Hu, Zongjun wrote:
>
> Hi,
>
> We use Gold-2.1.6 as allocation manager for Moab-5.2.0. We are trying
> to move this configuration to production. We got several issues and
> need help.
>
> 1. This binding is working as job reservation @ job start. Most of the
> time, this mode works perfectly. If a job is accepted to start, a new
> job will be created in gold and account balance will be deducted
> according to requested cpu hours. If this job runs and finishes without
> problem, this job will be charged finally according to real usage. The
> previous charge in gold will be returned to account balance. However,
> we got an interesting problem. We found some jobs are accepted to
> start, and then moved to compute nodes. For some reason, these jobs do
> not start successfully on compute nodes. They are then rejected and
> moved back to blocked/waiting list. After a while, these jobs will be
> evaluated again and repeat these steps. In this situation, multiple
> jobs will be created in gold and account balance will be deducted
> multiple times (to make things worse, user usually request much more
> than they need). When these jobs are finally finished or canceled, gold
> will only change the last job created to 'Charge' stage and put back
> only the last deduction back to account. All previous balance
> jobs/deductions created will stay in gold and the reserved balance
> won't be restored. After a while, gold will have a huge amount of
> 'Reserved balance' even all jobs are completed. Can you give us
> instruction to fix this problem and release all those unnecessary
> reserved balance?
>
I believe that Moab should be releasing these Gold reservations when it
discovers that they have been rejected (failed to start successfully).
That would be the correct fix. As it is, the reservations only remain
active within Gold for the wallclock duration of the job, then they
automatically become inactive and no longer affect the balance. Anytime
you need to, you can remove a reservation within gold with the grmres
command. Also, grmres -I can be used to get rid of all of the stale
reservations that no longer are affecting the balance. At any given
time, if you run glsres -A, the list of reservations returned should
pertain only to currently running jobs. Can you tell me what version of
Moab you are running? The reason I ask is because I am surprised that
the multiple reservations are creating multiple jobs within Gold. I
believe it was quite a long time ago that Moab should have started using
a new Replace=True option in the reservations to avoid creating new job
instances in Gold. Also, please tell me the Gold version. As far as the
reservations not being removed, I would say this is a Moab bug and would
recommend you submit a ticket to moab-support at clusterresources.com
explaining the problem and providing what evidence you can collect of it
(run support.diag.pl and send resulting tarball along with goldd.log,
any pertinent torque logs, etc).
> 2. Sometimes, we have lot of small jobs (finish in 1 minute). Because
> for each job, Moab has to contact gold server to reserve and then
> charge job when it finishes. Those small jobs make moab repeat these
> steps frequently and moab server is very busy. This slow down reponse
> to user reqeust a lot and sometimes time out in user request. Is there
> a way to speed up gold job processing? If not, can we configure Moab-
> gold binding to Job charge @ job end time? Therefore we can save at
> least half of the processing time. We did not find guidance in Moab or
> Gold documents for this configuration.
>
I don't know if that is configurable within Moab. (I think it probably
should be but I doubt that it is). I believe that CRI would accept this
change request and provide a Moab parameter to avoid doing the
reservation. You would have to submit a ticket to
moab-support at clusterresources.com to see if Moab can accomodate this
change request. It would be possible to take it out of the hands of
Moab entirely if you wanted to by simply writing your own epilog to call
gcharge (and taking out the applicable AMCFG parameters out of moab.cfg).
How much time are your reservations and charges taking? If these are
larger than a fraction of a second (.2 or .3 sec), then yI would ask to
see if you are VACUUMing your database freqently and see if you have the
indexes setup in your database. Please let me know about this.
I hope this helps,
Scott
> Thanks.
>
> Zongjun Hu
> -----
> University of Miami
> Center for Computational Science
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> gold-users mailing list
> gold-users at supercluster.org
> http://www.supercluster.org/mailman/listinfo/gold-users
>
More information about the gold-users
mailing list