[torqueusers] vmem and pvmem
siegert at sfu.ca
Mon Mar 5 21:00:35 MST 2012
On Sat, Feb 25, 2012 at 10:50:00AM +1100, David Singleton wrote:
> On 02/25/2012 09:00 AM, Martin Siegert wrote:
> > On Fri, Feb 24, 2012 at 11:19:37AM +0100, "Mgr. Šimon Tóth" wrote:
> >>> Core_req vmem pvmem ulimit-v RPT
> >>> =========================================
> >>> nodes=1:ppn=2 1gb 256mb 256mb 512mb
> >>> procs=2 1gb 256mb 256mb 1gb
> >>> nodes=1:ppn=2 1gb 4gb 1gb 4gb
> >>> procs=2 1gb 4gb 1gb 4gb
> >>> nodes=1:ppn=2 1gb - 1gb 512mb
> >>> procs=2 1gb - 1gb 1gb
> >>> So the ulimit value that influences whether a task can allocate
> >>> memory, is set as the lower of the vmem and pvmem values. That
> >>> makes some sense - at least more sense than taking the larger
> >>> value. What doesn't make sense is allowing pvmem to be higher
> >>> than vmem in the first place - in that case torque should probably
> >>> reject the job or 'fix' one of the settings but leaving it as is
> >>> might not be so bad, except for moab's behaviour (keep reading).
> >> No. The logic is as follows:
> >> * if pvmem (or pmem) is set
> >> then set the corresponding ulimit to pvmem (pmem) value
> >> * if pvmem (or pmem) isn't set
> >> then set the corresponding ulimit to vmem (mem) value
> >> Note that using pvmem is mostly pointless. On Linux this represents
> >> address space, not virtual memory.
> >> You can use vmem as virtual memory, but even that is extremely confusing.
> > I do not understand this comment. Both pvmem and vmem requests will
> > result in RLIMIT_AS getting set.
> I disagree with vmem setting RLIMIT_AS if that is what is happening.
> > When I submit a MPI job using, e.g., procs=N, why is requesting
> > pvmem=X mostly pointless? Shouldn't it be totally equivalent to
> > requesting vmem=X*N ?
> I think we have had the discussion of what procs means on a number of
> occasions (look for the thread "processes vs processors"). I believe "procs"
> (now) means (virtual) processORs (most commonly, they are cores). They are not
> processes. [In OpenPBS they were processes and only the UNICOS MOM supported
> that limit. At least in torque-3.0.2 procs is still not properly documented
> in pbs_resources* man pages.]
> pvmem sets some sort of memory limit per *process* so vmem should have nothing
> to do with procs and pvmem. pvmem and vmem are pretty much orthogonal. One is
> a voluntary limit the user places on their job processes (useless for actual
> resource scheduling) and the other is something any well-configured system
> should require a user to specify so that the resources of the system can be
> managed. In particular a job with only a pvmem limit can OOM any size node
> simply by spawning enough processes.
> Setting both independently (should a user choose to do so) seems perfectly
> sensible. But I agree with Gareth that it only makes sense to request
> vmem. Now what vmem actually is and how is should be evaluated and limited is
> a whole other discussion ...
You are right - my misconception was that pvmem would mean "address
space per assigned processOR" and thus directly correspond to the procs
I guess the problem is threefold:
a) my perception of what torque does with pvmem and vmem requests;
b) what torque actually does with pvmem and vmem requests;
c) what torque should do with pvmem and vmem requests.
I spent some time to figure out problem (b). I am not sure whether
all of the following is right, thus please correct me ...
In torque memory resources (vmem, pvmem, mem, pmem) are controlled
in two ways: 1) initially by setting an appropriate rlimit before
the job is started; 2) through a control mechanism that has the mom(s)
periodically poll the job to determine the usage and terminate the
job, if it is over the limit.
I'll concentrate on vmem/pvmem (virtual memory/address space) for now.
1) When a job requests pvmem=X bytes, the mom sets RLIMIT_AS to X
before starting the job (resmom/linux/mom_mach.c, mom_set_limits).
If both vmem and pvmem are requested RLIMIT_AS is set to the
lesser of the two values. (if pmem is specified as well, the
limit is set to the pmem value).
2) While the job is running the mom(s) check periodically whether
there are processes that belong to the job that use more memory
than X (resmom/linux/mom_mach.c, mom_over_limit, overmem_proc).
This appears to be absolutely pointless since such processes
cannot exist because of the rlimit set in (1). I.e. overmem_proc
always returns false and and can be eliminated.
Since RLIMIT_AS sets a per process limit nothing stops a program
with a pvmem request to spawn more an more processes and potentially
run a node out of memory. I am assuming that a scheduler (e.g. moab)
reserves X bytes of address space for each assigned processor (not
process!). Thus, there exists a discrepancy between what torque is
controlling (address space per process) and what the scheduler has
reserved for the job (address space per assigned core).
1) When a job requests vmem=Y bytes, the mom sets RLIMIT_AS to Y
before starting the job (resmom/linux/mom_mach.c, mom_set_limits).
I suspect the idea is that in principle a Np job that uses Y bytes
of address space could have one process that uses Y-eps bytes of
memory while the remaining Np-1 processes share the remaining eps
bytes. In that respect setting RLIMIT_AS to Y (instead of Y/Np)
is reasonable. However, how much memory is a scheduler reserving
for such a job assuming that not all processors are assigned on
the same node? I am guessing that the scheduler reserves just
Y/Np address space and consequently there is a potential that
nodes are oversubscribed.
2) While the job is running the mom(s) periodically sum up the address
space usage of all processes that belong to the job on the node
the mom is running on (resmom/linux/mom_mach.c, mom_over_limit,
mem_sum). However, there is nothing in the code where the mom
superior would sum up these results from each of the sister moms.
At least I cannot find anything in the mom_over_limit and mem_sum
routines that would do this. The consequence is that the control
mechanism effectively only takes the address space used on the
mom superior into account. I suspect that this is a bug/oversight.
E.g., if you run a job with procs=2, vmem=3gb and each of the two
processes end up using 2gb of address space, then the job will get
killed if the scheduler assigns two cores on the same node. However,
the job will not get killed if two processors get assigned on different
nodes. Strangely enough the reporting mechanism for, e.g., qstat -f
does query all moms. There is a spurious comment "only enforce cpu
time and memory usage" in mom_main.c. This isn't really correct
since vmem does get enforced in some strange way. I can't make
sense of this ...
There exists another problem with the vmem control mechanism: it
does not take shared memory into account. Let's assume that a job
is submitted with nodes=1:ppn=2,vmem=3gb. Initially the job starts
a single process that malloc's 2gb of memory. Then the job forks and
parent and child use the same 2gb of address space. Torque will add
up the 2gb from parent and child and kill the job because the mem_sum
routine does not check whether memory is shared between processes.
I do not know how this could be done, but the current mechanism is
incorrect nevertheless. What do people use when requesting memory
for a shared memory job?
As far as I understand neither the pvmem nor the vmem implementation
makes sense to me. Particularly, as I do not understand how this can
work with a scheduler that needs to reserve resources for a job. As far
as I am concerned I would like to see the following:
I. pvmem controls the amount of address space available to a job per
assigned processor. I.e., the control process should sum up the
address space of all the processes that were started by the mom
initially. As far as I can tell this may not be too difficult to
implement, since these processes should all have the same session id.
II. vmem controls the total amount of address space for the job, i.e.,
the memory is added over all processes belonging to the job (not
just on the mom superior). And shared memory should not be double
III. In the long run we may want to think about implementing different
(p)vmem requests per requested processor ...
More information about the torqueusers