[torqueusers] vmem and pvmem

Martin Siegert siegert at sfu.ca
Mon Mar 5 21:00:35 MST 2012


On Sat, Feb 25, 2012 at 10:50:00AM +1100, David Singleton wrote:
> On 02/25/2012 09:00 AM, Martin Siegert wrote:
> > On Fri, Feb 24, 2012 at 11:19:37AM +0100, "Mgr. Šimon Tóth" wrote:
> >>> Core_req       vmem  pvmem ulimit-v RPT
> >>> =========================================
> >>> nodes=1:ppn=2  1gb   256mb 256mb    512mb
> >>> procs=2        1gb   256mb 256mb    1gb
> >>> nodes=1:ppn=2  1gb   4gb   1gb      4gb
> >>> procs=2        1gb   4gb   1gb      4gb
> >>> nodes=1:ppn=2  1gb   -     1gb      512mb
> >>> procs=2        1gb   -     1gb      1gb
> >>>
> >>> So the ulimit value that influences whether a task can allocate
> >>> memory, is set as the lower of the vmem and pvmem values. That
> >>> makes some sense - at least more sense than taking the larger
> >>> value.  What doesn't make sense is allowing pvmem to be higher
> >>> than vmem in the first place - in that case torque should probably
> >>> reject the job or 'fix' one of the settings but leaving it as is
> >>> might not be so bad, except for moab's behaviour (keep reading).
> >>
> >> No. The logic is as follows:
> >>
> >> * if pvmem (or pmem) is set
> >>     then set the corresponding ulimit to pvmem (pmem) value
> >>
> >> * if pvmem (or pmem) isn't set
> >>     then set the corresponding ulimit to vmem (mem) value
> >>
> >> Note that using pvmem is mostly pointless. On Linux this represents
> >> address space, not virtual memory.
> >>
> >> You can use vmem as virtual memory, but even that is extremely confusing.
> >
> > I do not understand this comment. Both pvmem and vmem requests will
> > result in RLIMIT_AS getting set.
> 
> I disagree with vmem setting RLIMIT_AS if that is what is happening.
> 
> > When I submit a MPI job using, e.g., procs=N, why is requesting
> > pvmem=X mostly pointless? Shouldn't it be totally equivalent to
> > requesting vmem=X*N ?
> >
> 
> I think we have had the discussion of what procs means on a number of
> occasions (look for the thread "processes vs processors").  I believe "procs"
> (now) means (virtual) processORs (most commonly, they are cores).  They are not
> processes.  [In OpenPBS they were processes and only the UNICOS MOM supported
> that limit.  At least in torque-3.0.2 procs is still not properly documented
> in pbs_resources* man pages.]
> 
> pvmem sets some sort of memory limit per *process* so vmem should have nothing
> to do with procs and pvmem.  pvmem and vmem are pretty much orthogonal. One is
> a voluntary limit the user places on their job processes (useless for actual
> resource scheduling) and the other is something any well-configured system
> should require a user to specify so that the resources of the system can be
> managed.  In particular a job with only a pvmem limit can OOM any size node
> simply by spawning enough processes.
> 
> Setting both independently (should a user choose to do so) seems perfectly
> sensible.  But I agree with Gareth that it only makes sense to request
> vmem.  Now what vmem actually is and how is should be evaluated and limited is
> a whole other discussion ...

You are right - my misconception was that pvmem would mean "address
space per assigned processOR" and thus directly correspond to the procs
request. 

I guess the problem is threefold:
a) my perception of what torque does with pvmem and vmem requests;
b) what torque actually does with pvmem and vmem requests;
c) what torque should do with pvmem and vmem requests.

I spent some time to figure out problem (b). I am not sure whether
all of the following is right, thus please correct me ...

In torque memory resources (vmem, pvmem, mem, pmem) are controlled
in two ways: 1) initially by setting an appropriate rlimit before
the job is started; 2) through a control mechanism that has the mom(s)
periodically poll the job to determine the usage and terminate the
job, if it is over the limit.

I'll concentrate on vmem/pvmem (virtual memory/address space) for now.

A. pvmem
--------
1) When a job requests pvmem=X bytes, the mom sets RLIMIT_AS to X
   before starting the job (resmom/linux/mom_mach.c, mom_set_limits).
   If both vmem and pvmem are requested RLIMIT_AS is set to the
   lesser of the two values. (if pmem is specified as well, the
   limit is set to the pmem value).
2) While the job is running the mom(s) check periodically whether
   there are processes that belong to the job that use more memory
   than X (resmom/linux/mom_mach.c, mom_over_limit, overmem_proc).
   This appears to be absolutely pointless since such processes
   cannot exist because of the rlimit set in (1). I.e. overmem_proc
   always returns false and and can be eliminated.

Since RLIMIT_AS sets a per process limit nothing stops a program
with a pvmem request to spawn more an more processes and potentially
run a node out of memory. I am assuming that a scheduler (e.g. moab)
reserves X bytes of address space for each assigned processor (not
process!). Thus, there exists a discrepancy between what torque is
controlling (address space per process) and what the scheduler has
reserved for the job (address space per assigned core).

B. vmem
-------
1) When a job requests vmem=Y bytes, the mom sets RLIMIT_AS to Y
   before starting the job (resmom/linux/mom_mach.c, mom_set_limits).
   I suspect the idea is that in principle a Np job that uses Y bytes
   of address space could have one process that uses Y-eps bytes of
   memory while the remaining Np-1 processes share the remaining eps
   bytes. In that respect setting RLIMIT_AS to Y (instead of Y/Np)
   is reasonable. However, how much memory is a scheduler reserving
   for such a job assuming that not all processors are assigned on
   the same node? I am guessing that the scheduler reserves just
   Y/Np address space and consequently there is a potential that
   nodes are oversubscribed.
2) While the job is running the mom(s) periodically sum up the address
   space usage of all processes that belong to the job on the node
   the mom is running on (resmom/linux/mom_mach.c, mom_over_limit,
   mem_sum). However, there is nothing in the code where the mom
   superior would sum up these results from each of the sister moms.
   At least I cannot find anything in the mom_over_limit and mem_sum
   routines that would do this. The consequence is that the control
   mechanism effectively only takes the address space used on the
   mom superior into account. I suspect that this is a bug/oversight.
   E.g., if you run a job with procs=2, vmem=3gb and each of the two
   processes end up using 2gb of address space, then the job will get
   killed if the scheduler assigns two cores on the same node. However,
   the job will not get killed if two processors get assigned on different
   nodes. Strangely enough the reporting mechanism for, e.g., qstat -f
   does query all moms. There is a spurious comment "only enforce cpu
   time and memory usage" in mom_main.c. This isn't really correct
   since vmem does get enforced in some strange way. I can't make
   sense of this ...
   There exists another problem with the vmem control mechanism: it
   does not take shared memory into account. Let's assume that a job
   is submitted with nodes=1:ppn=2,vmem=3gb. Initially the job starts
   a single process that malloc's 2gb of memory. Then the job forks and
   parent and child use the same 2gb of address space. Torque will add
   up the 2gb from parent and child and kill the job because the mem_sum
   routine does not check whether memory is shared between processes.
   I do not know how this could be done, but the current mechanism is
   incorrect nevertheless. What do people use when requesting memory
   for a shared memory job?

As far as I understand neither the pvmem nor the vmem implementation
makes sense to me. Particularly, as I do not understand how this can
work with a scheduler that needs to reserve resources for a job. As far
as I am concerned I would like to see the following:
I. pvmem controls the amount of address space available to a job per
   assigned processor. I.e., the control process should sum up the
   address space of all the processes that were started by the mom
   initially. As far as I can tell this may not be too difficult to
   implement, since these processes should all have the same session id.
II. vmem controls the total amount of address space for the job, i.e.,
   the memory is added over all processes belonging to the job (not
   just on the mom superior). And shared memory should not be double
   counted.
III. In the long run we may want to think about implementing different
   (p)vmem requests per requested processor ...

Cheers,
Martin


More information about the torqueusers mailing list