[Mauiusers] problems with many job submissions

Si Hammond simon.hammond at gmail.com
Thu May 14 12:24:55 MDT 2009


Hi Roy/Steve,

Could you get the user to put multiple executions inside one submit  
script/job so that they get the same executions done but in coarser  
grained blocks rather than as individual jobs?



Si Hammond

High Performance Systems Group
University of Warwick, UK

On 14 May 2009, at 19:11, Steve Young wrote:

> Thanks Roy,
> 	Actually, I'm doing something similar...  max_queuable = 120 for the
> execution queue and everyone gets a MAXIJOB of 4. Like you mention I'd
> rather not worry about small jobs either and traditionally we've had
> jobs that run for hours or weeks and works perfectly. Just this one
> user (who refuses to look at MPI since they've been told it's too hard
> to learn) seems to come up with strange ways of running their
> homegrown code. We're using mysql for a db server. I expect that
> anything I can tweak would be here, similar to what you point out for
> postgres. Any pointers people have for mysql databases and gold? I
> hope at the least I will be able to talk them into running multiple
> commands within one job so to save on the overhead of starting/
> stopping the job on a node. Thanks again for the advice.
>
> -Steve
>
>
>
>
>
> On May 14, 2009, at 1:23 PM, Roy Dragseth wrote:
>
>> On Thursday 14 May 2009 17:45:22 Steve Young wrote:
>>> Hi all,
>>> 	I have been experiencing a problem with a user submitting thousands
>>> of jobs. Out of most of the jobs they seem to either finish in a
>>> matter of seconds or aren't even doing anything. I'm using torque,
>>> maui and gold. Now I'm using a routing queue to contain the 10,000
>>> jobs they submit (all single cpu jobs). The routing queue works fine
>>> and routes to the proper execution queue (able to run 116 at a  
>>> time).
>>> However, I notice as the system is chewing through the jobs trying  
>>> to
>>> execute them they drop off so fast the system is having a hard time
>>> trying to keep up. The mysql server goes to 100% and even a load on
>>> goldd. I suspect it's because the flurry of jobs starting/stopping  
>>> so
>>> fast that creating the reservations and other record-keeping in  
>>> maui/
>>> gold is making this load.
>>> 	I'm hoping to get the user to make some changes to how they submit
>>> jobs (but they can be difficult at times). I suspect that even if  
>>> the
>>> jobs ran for 5 minutes or so that then the system could at least  
>>> keep
>>> up. So I'm curious to know if any others ran into this type of
>>> problem
>>> and what you did to solve it. Are there some changes in torque/maui/
>>> gold that I could make to help alleviate this?
>>>
>>
>> I posted the exact same question on the gold list last year (but the
>> archive
>> at pnl.gov is gone and I could not find the thread on the
>> clusterresources
>> archive).
>>
>> If you do not want to write your own layer between maui and gold
>> you're pretty
>> much stuck.
>>
>> We ended up limiting the number of idle and running jobs per user.
>> Per
>> default each user is limited to 200 running jobs and 16 idle jobs
>>
>> USERCFG[DEFAULT] MAXJOB=200 MAXIJOB=16
>>
>> Our policy is not to optimize the batch system for lots of small
>> jobs.  By
>> setting the above limits we sort of encourage our users to adjust
>> their work
>> setup.  Even if you bring down the response time from accounting the
>> scaling
>> will be limited,  Amdahls law will kick in eventually...
>>
>> If you're using postgres as the backend for gold you should vacuum  
>> the
>> database regularly.  The gold user has this in crontab
>>
>> # su - gold
>> -bash-3.00$ crontab -l
>> 00 04 * * * sh /opt/gold/vacuum.sh
>> -bash-3.00$ cat /opt/gold/vacuum.sh
>> #!/bin/sh
>>
>> # vacuum the database, makes it run faster.
>>
>> /usr/bin/psql -c "vacuum; vacuum analyze;"
>>
>>
>> Doing this brought down the accounting response down from 6 to 1
>> seconds.
>> (our db server is really slow...)
>>
>> r.
>>
>>
>>
>> -- 
>> The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
>>             phone:+47 77 64 41 07, fax:+47 77 64 41 00
>>    Roy Dragseth, Team Leader, High Performance Computing
>>        Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no
>> _______________________________________________
>> mauiusers mailing list
>> mauiusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/mauiusers
>
> _______________________________________________
> mauiusers mailing list
> mauiusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/mauiusers

Si Hammond

Performance Modelling, Analysis and Optimisation Team
High Performance Systems Group
Department of Computer Science
University of Warwick, CV4 7AL, UK







More information about the mauiusers mailing list