[torqueusers] Short of physical memory, crash?

Christoph (Stucki) von Stuckrad stucki at mi.fu-berlin.de
Fri Dec 21 06:01:58 MST 2012


On Fri, 21 Dec 2012, Diego Bacchin wrote:

> In my experience the node will start use the swap partition. the jobs
> will work if you have enough swap but the performance will be very
> very slow.

This is correct, if the memory-limit is not enforced.
I think, there are three typical cases:

Given a large number of jobs 'supposed to be <4G' and
SETTING the limit to 4G via qsub (or jobscripts, defaults).

Torque/Maui will make sure, to put no more jobs into a machine
than there is memory. (In my test case only three in a 32G host!
Seemingly this uses/counts the really free memory (so the system
itself uses too much to allow the forth job in my case).

Now, what happens, if the job starts to allocate more,
will depend on the settings of torque (I believe).

1) In our case here, the job will be killed shortly after
starting to grow.
To 'survive this time of growth till killing it' the
system uses the swapspace if necessary, and will definitely
slow down to a crawl as long as it is swapping in AND out,
but only for a relatively short time.

I never tried yet, but reading the manuals I think one
can define alternatives to killing the job, so you might
2) simply let them run (but slow), if you're sure
they each 'overbook' the memory only for a short time,
and NOT ALL AT ONCE - if memory AND swap BOTH are
exhausted, the Kernel will randomly kill 'programs which
request more memory' and the system will be unstable or
die horribly (or e.g. only torque_mom dies first).

3) There also seem to be settings to allow SLIGHT overbooking
for the sum of all jobs, to 'fill' the host completely.
(In my case I'd have to allow near 2G to crowd in another
4G job, which will result in 'swap out' of near 2G mem,
but there's a good chance, those 2G might not really be
needed all the time, so might NOT slow down overall use).
I have not found out, whether such a 'soft limit' versus
'hard/kill limit' solution exists for the jobs themselves.

And may be somebody will point me(us?) in the correct
direction, how to calculate/install the correct settings
for such 'overbooking of memory to fully use the nodes' ???

Yours  Stucki (starting cluster admin)

-- 
Christoph von Stuckrad      * * |nickname |Mail <stucki at mi.fu-berlin.de> \
Freie Universitaet Berlin   |/_*|'stucki' |Tel(Mo.,Mi.):+49 30 838-75 459|
Mathematik & Informatik EDV |\ *|if online|  (Di,Do,Fr):+49 30 77 39 6600|
Takustr. 9 / 14195 Berlin   * * |on IRCnet|Fax(home):   +49 30 77 39 6601/


More information about the torqueusers mailing list