[Mauiusers] problems with many job submissions

Steve Young chemadm at hamilton.edu
Thu May 14 12:44:21 MDT 2009


Yea that's what I'm hoping to talk to them about and get them to do.   
I was just curious to hear what others might have done if they  
encountered this situation and if there was any tuning on the system  
to help this. Thanks,

-Steve



On May 14, 2009, at 2:24 PM, Si Hammond wrote:

> Hi Roy/Steve,
>
> Could you get the user to put multiple executions inside one submit  
> script/job so that they get the same executions done but in coarser  
> grained blocks rather than as individual jobs?
>
>
>
> Si Hammond
>
> High Performance Systems Group
> University of Warwick, UK
>
> On 14 May 2009, at 19:11, Steve Young wrote:
>
>> Thanks Roy,
>> 	Actually, I'm doing something similar...  max_queuable = 120 for the
>> execution queue and everyone gets a MAXIJOB of 4. Like you mention  
>> I'd
>> rather not worry about small jobs either and traditionally we've had
>> jobs that run for hours or weeks and works perfectly. Just this one
>> user (who refuses to look at MPI since they've been told it's too  
>> hard
>> to learn) seems to come up with strange ways of running their
>> homegrown code. We're using mysql for a db server. I expect that
>> anything I can tweak would be here, similar to what you point out for
>> postgres. Any pointers people have for mysql databases and gold? I
>> hope at the least I will be able to talk them into running multiple
>> commands within one job so to save on the overhead of starting/
>> stopping the job on a node. Thanks again for the advice.
>>
>> -Steve
>>
>>
>>
>>
>>
>> On May 14, 2009, at 1:23 PM, Roy Dragseth wrote:
>>
>>> On Thursday 14 May 2009 17:45:22 Steve Young wrote:
>>>> Hi all,
>>>> 	I have been experiencing a problem with a user submitting  
>>>> thousands
>>>> of jobs. Out of most of the jobs they seem to either finish in a
>>>> matter of seconds or aren't even doing anything. I'm using torque,
>>>> maui and gold. Now I'm using a routing queue to contain the 10,000
>>>> jobs they submit (all single cpu jobs). The routing queue works  
>>>> fine
>>>> and routes to the proper execution queue (able to run 116 at a  
>>>> time).
>>>> However, I notice as the system is chewing through the jobs  
>>>> trying to
>>>> execute them they drop off so fast the system is having a hard time
>>>> trying to keep up. The mysql server goes to 100% and even a load on
>>>> goldd. I suspect it's because the flurry of jobs starting/ 
>>>> stopping so
>>>> fast that creating the reservations and other record-keeping in  
>>>> maui/
>>>> gold is making this load.
>>>> 	I'm hoping to get the user to make some changes to how they submit
>>>> jobs (but they can be difficult at times). I suspect that even if  
>>>> the
>>>> jobs ran for 5 minutes or so that then the system could at least  
>>>> keep
>>>> up. So I'm curious to know if any others ran into this type of
>>>> problem
>>>> and what you did to solve it. Are there some changes in torque/ 
>>>> maui/
>>>> gold that I could make to help alleviate this?
>>>>
>>>
>>> I posted the exact same question on the gold list last year (but the
>>> archive
>>> at pnl.gov is gone and I could not find the thread on the
>>> clusterresources
>>> archive).
>>>
>>> If you do not want to write your own layer between maui and gold
>>> you're pretty
>>> much stuck.
>>>
>>> We ended up limiting the number of idle and running jobs per user.
>>> Per
>>> default each user is limited to 200 running jobs and 16 idle jobs
>>>
>>> USERCFG[DEFAULT] MAXJOB=200 MAXIJOB=16
>>>
>>> Our policy is not to optimize the batch system for lots of small
>>> jobs.  By
>>> setting the above limits we sort of encourage our users to adjust
>>> their work
>>> setup.  Even if you bring down the response time from accounting the
>>> scaling
>>> will be limited,  Amdahls law will kick in eventually...
>>>
>>> If you're using postgres as the backend for gold you should vacuum  
>>> the
>>> database regularly.  The gold user has this in crontab
>>>
>>> # su - gold
>>> -bash-3.00$ crontab -l
>>> 00 04 * * * sh /opt/gold/vacuum.sh
>>> -bash-3.00$ cat /opt/gold/vacuum.sh
>>> #!/bin/sh
>>>
>>> # vacuum the database, makes it run faster.
>>>
>>> /usr/bin/psql -c "vacuum; vacuum analyze;"
>>>
>>>
>>> Doing this brought down the accounting response down from 6 to 1
>>> seconds.
>>> (our db server is really slow...)
>>>
>>> r.
>>>
>>>
>>>
>>> -- 
>>> The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
>>>            phone:+47 77 64 41 07, fax:+47 77 64 41 00
>>>   Roy Dragseth, Team Leader, High Performance Computing
>>>       Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no
>>> _______________________________________________
>>> mauiusers mailing list
>>> mauiusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>
>> _______________________________________________
>> mauiusers mailing list
>> mauiusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/mauiusers
>
> Si Hammond
>
> Performance Modelling, Analysis and Optimisation Team
> High Performance Systems Group
> Department of Computer Science
> University of Warwick, CV4 7AL, UK
>
>
>
>
>



More information about the mauiusers mailing list