Bugzilla – Bug 93
Resource management semantics of Torque need to be well defined
Last modified: 2010-12-07 09:26:54 MST
You need to log in before you can comment on or make changes to this bug.
Currently Torque includes inconsistent resource management semantics. These semantics need to be redefined. * External Schedulers * From what I have been told (I only work with plain Torque), external schedulers (Moab, Maui) send in their run requests a very specific nodespec or directly an exechost list. If this is not true then we need to consider what semantics do external schedulers expect from Torque. If this is true, then these schedulers can be safely ignored (as far as resource semantics go). * Process (ppn) semantics * Process semantics should be dumped completely. The only thing that they are useful right now is limiting vmem in a per-process manner. The number of processes isn't limited by torque (not 100% sure here) and with the liberal approach towards forking in most Linux software, this wouldn't be a good idea either. * Per-job, per-node, per-process resource * Even when the process semantic is dumped we still need to distinct between per-node and per-job resources. For example mem should definitely be a per-node resource while number of matlab licenses should definitely be a per-job resource. * Configurable with pre-set defaults or strict * I would definitely like a configurable approach. Setting flags in the resource definition (as done in my bug 67) is probably not the best approach (so we need to come up with something more sane). In both cases we need to define a set of fully internally supported resources. This is a list of resources I consider essential: - ncpus - mem - vmem - GPU - walltime - cputime Plus we need some generic resources, that are checked (ie. if job requires 4 kitchen-sinks and node only has 2 available, then the job cannot be run), but don't have any special semantics. Support without semantics: - generic per-node counted resource (counted/enforced only on server) - generic per-job counted resource (counted/enforced only on server) * Cgroups - Linux specific * I have been digging through cgroups docs and the good thing is we can replace a lot of the Linux stuff with cgroups that should work reliably. Stuff that cgroups can do: - memory (mem, vmem, oom killer configuration) - cpusets - devices (limiting access) - should work well for GPUs or any generic HW requiring dedicated access - frozen containers - accounting
External schedulers - I think you're right for both Moab and Maui, they both set exec_host. PPN = processors per node (according to manual page), really virtual processors as you can overcommit if you are not using cpusets. I've seen plenty of commercial software out there that uses them, so I don't think it can go away. The pvmem limits which you mention are vital to us. Different resource limits - I think the current per process and per job limits make enough sense, it's easy for users to understand. The only real issue is that you cannot set a proactively enforced (i.e. malloc fails) limit across a job as a whole. But that's enforced by the scheduler anyway (at least with Maui and Moab). Resources we need: pvmem procs and tpn walltime software nodes and ppn (for commercial software which supports PBS) Cgroups - I reckon it's a good plan for the future but we need to realise that it's not going to really arrive for most clusters until RHEL6/CentOS6 starts getting deployed. Also you cannot have both cpusets and cgroups mounted at the same time so the current code needs to be refactored/abstracted to be able to cope with either one being present. It cannot depend on a feature of cgroups being present but should give you the benefits if it is.
> External schedulers - I think you're right for both Moab and Maui, they both > set exec_host. That would be great. > PPN = processors per node (according to manual page), really virtual processors > as you can overcommit if you are not using cpusets. I've seen plenty of > commercial software out there that uses them, so I don't think it can go away. > The pvmem limits which you mention are vital to us. Well, that's the problem, then manual page says processors per node, but that's not how Torque works (this is exactly the reason why I created this bug). They are processes per node. I'm not saying to get rid of ppn, but to get rid of the processes semantics, therefore ppn will be actually processors not processes. pvmem can actually stay, although I think pmem and pvmem can be easily superseded by mem and vmem. Plus when you request -l nodes=2:ppn=2:pvmem=2G how much memory do you expect to get? In the current Torque semantics it is 2*2*2G. > Different resource limits - I think the current per process and per job limits > make enough sense, it's easy for users to understand. The only real issue is > that you cannot set a proactively enforced (i.e. malloc fails) limit across a > job as a whole. But that's enforced by the scheduler anyway (at least with > Maui and Moab). The issue is that it is enforced internally by the schedulers. My target is to make all this work even with qrun. That implies that basic resources like mem, cpus, etc.. must have a well defined semantic inside Torque itself. > Cgroups - I reckon it's a good plan for the future but we need to realise that > it's not going to really arrive for most clusters until RHEL6/CentOS6 starts > getting deployed. Also you cannot have both cpusets and cgroups mounted at the > same time so the current code needs to be refactored/abstracted to be able to > cope with either one being present. > > It cannot depend on a feature of cgroups being present but should give you the > benefits if it is. Actually my idea was to create a new cgroups platform (new folder in src/resmom/).
> > > PPN = processors per node (according to manual page), really virtual processors > > as you can overcommit if you are not using cpusets. I've seen plenty of > > commercial software out there that uses them, so I don't think it can go away. > > The pvmem limits which you mention are vital to us. > > Well, that's the problem, then manual page says processors per node, but that's > not how Torque works (this is exactly the reason why I created this bug). They > are processes per node. I'm not saying to get rid of ppn, but to get rid of the > processes semantics, therefore ppn will be actually processors not processes. > pvmem can actually stay, although I think pmem and pvmem can be easily > superseded by mem and vmem. I understand the frustration with ppn not really meaning processors per node. However, the current behavior of ppn is widely used and expected. We need to live with this. Changing this behavior will break too many people.
(In reply to comment #3) > > > > > PPN = processors per node (according to manual page), really virtual processors > > > as you can overcommit if you are not using cpusets. I've seen plenty of > > > commercial software out there that uses them, so I don't think it can go away. > > > The pvmem limits which you mention are vital to us. > > > > Well, that's the problem, then manual page says processors per node, but that's > > not how Torque works (this is exactly the reason why I created this bug). They > > are processes per node. I'm not saying to get rid of ppn, but to get rid of the > > processes semantics, therefore ppn will be actually processors not processes. > > pvmem can actually stay, although I think pmem and pvmem can be easily > > superseded by mem and vmem. > > I understand the frustration with ppn not really meaning processors per node. > However, the current behavior of ppn is widely used and expected. We need to > live with this. Changing this behavior will break too many people. In what way are they using it as processes? Are they requesting the MOM call setrlimit(RLIMIT_NPROC)? Are they killing jobs if jobs are detected as having more than that many processes running on a node? None of these make any sense whatsoever (unless some large forkbomb limit is applied - but that should be a system limit, not a user resource request). Is the ppn value being used to impose pvmem or pmem limits some how? I dont see that in the Torque code? By external schedulers? How? I suspect "processes per node" only really appears in flawed and misleading documentation, not in real code.
(In reply to comment #4) > (In reply to comment #3) > > > > > > > PPN = processors per node (according to manual page), really virtual processors > > > > as you can overcommit if you are not using cpusets. I've seen plenty of > > > > commercial software out there that uses them, so I don't think it can go away. > > > > The pvmem limits which you mention are vital to us. > > > > > > Well, that's the problem, then manual page says processors per node, but that's > > > not how Torque works (this is exactly the reason why I created this bug). They > > > are processes per node. I'm not saying to get rid of ppn, but to get rid of the > > > processes semantics, therefore ppn will be actually processors not processes. > > > pvmem can actually stay, although I think pmem and pvmem can be easily > > > superseded by mem and vmem. > > > > I understand the frustration with ppn not really meaning processors per node. > > However, the current behavior of ppn is widely used and expected. We need to > > live with this. Changing this behavior will break too many people. > > In what way are they using it as processes? Are they requesting the MOM call > setrlimit(RLIMIT_NPROC)? Are they killing jobs if jobs are detected as having > more than that many processes running on a node? None of these make any sense > whatsoever (unless some large forkbomb limit is applied - but that should be a > system limit, not a user resource request). > > Is the ppn value being used to impose pvmem or pmem limits some how? I dont see > that in the Torque code? By external schedulers? How? > > I suspect "processes per node" only really appears in flawed and misleading > documentation, not in real code. Processes per node is often how it is explained, although you are right, it isn't restricted in any way to actually limit the number of processes that can be run. It may have originally been intended to be processors per node, but now almost all processors intended for computing have multiple cores, making processors per node completely ambiguous and therefore not very useful. However, it is in the code in a few ways: ppn is the number of times that nodename will appear in the $PBS_NODEFILE. This is intended to be read by the mpi scripts on the program to then make that many processes. There is nothing in TORQUE that stops the scripts from spawning more processes though. ppn is left completely configurable per node, and so the notion that it is tied to the actual hardware is false. Often in production systems, ppn becomes cores per node, because that's how many the system admin wants for optimal use. The fact of the matter is that ppn hasn't been clearly defined over time, and what it has become in practice is probably best described as processes per node. At any rate, changing this behavior would greatly disrupt life for *very* many TORQUE users.
(In reply to comment #5) > Processes per node is often how it is explained, although you are right, it > isn't restricted in any way to actually limit the number of processes that can > be run. It may have originally been intended to be processors per node, but now > almost all processors intended for computing have multiple cores, making > processors per node completely ambiguous and therefore not very useful. > > However, it is in the code in a few ways: > > ppn is the number of times that nodename will appear in the $PBS_NODEFILE. This > is intended to be read by the mpi scripts on the program to then make that many > processes. There is nothing in TORQUE that stops the scripts from spawning more > processes though. > > ppn is left completely configurable per node, and so the notion that it is tied > to the actual hardware is false. Often in production systems, ppn becomes cores > per node, because that's how many the system admin wants for optimal use. > > The fact of the matter is that ppn hasn't been clearly defined over time, and > what it has become in practice is probably best described as processes per > node. At any rate, changing this behavior would greatly disrupt life for *very* > many TORQUE users. As Chris Samuel pointed out, the "p" in "ppn" meant "virtual processors". A "virtual processor" can mean a core - for most us that is exactly what it means. It can mean an "execution slot" for those sites that set node np greater than the number of physical cores (or hyperthread contexts). The important thing is that it is a characteristic of the hardware/system/site. It is not a property of the job. The number of processes in a job is a property of a job. In general there is no alignment. If I was to run a 16 thread OpenMP job, what value of ppn do I use? The OpenMP app will have 1 process. But then there will be 2 shells in the job so its likely to be 3 processes. So ppn=3 ? What I actually want is 16 bits of hardware that each can run a thread without conflict (as much as possible), i.e. I want 16 virtual processors. Yes, the use of the term "processor" needs to be spelt out as above. But at least it can be made technically accurate. The use of the term "process" cannot unless you want to turn it into a property of the system.
(In reply to comment #6) > (In reply to comment #5) > > Processes per node is often how it is explained, although you are right, it > > isn't restricted in any way to actually limit the number of processes that can > > be run. It may have originally been intended to be processors per node, but now > > almost all processors intended for computing have multiple cores, making > > processors per node completely ambiguous and therefore not very useful. > > > > However, it is in the code in a few ways: > > > > ppn is the number of times that nodename will appear in the $PBS_NODEFILE. This > > is intended to be read by the mpi scripts on the program to then make that many > > processes. There is nothing in TORQUE that stops the scripts from spawning more > > processes though. > > > > ppn is left completely configurable per node, and so the notion that it is tied > > to the actual hardware is false. Often in production systems, ppn becomes cores > > per node, because that's how many the system admin wants for optimal use. > > > > The fact of the matter is that ppn hasn't been clearly defined over time, and > > what it has become in practice is probably best described as processes per > > node. At any rate, changing this behavior would greatly disrupt life for *very* > > many TORQUE users. > > As Chris Samuel pointed out, the "p" in "ppn" meant "virtual processors". A > "virtual processor" can mean a core - for most us that is exactly what it > means. It can mean an "execution slot" for those sites that set node np > greater than the number of physical cores (or hyperthread contexts). The > important thing is that it is a characteristic of the hardware/system/site. It > is not a property of the job. The number of processes in a job is a property > of a job. In general there is no alignment. > > If I was to run a 16 thread OpenMP job, what value of ppn do I use? The OpenMP > app will have 1 process. But then there will be 2 shells in the job so its > likely to be 3 processes. So ppn=3 ? What I actually want is 16 bits of > hardware that each can run a thread without conflict (as much as possible), > i.e. I want 16 virtual processors. > > Yes, the use of the term "processor" needs to be spelt out as above. But at > least it can be made technically accurate. The use of the term "process" cannot > unless you want to turn it into a property of the system. I'm not sure what change Simon wanted but, just to be clear, this looks like a purely documentation issue to me. The only thing that has changed since the "good ol' PBS days" is that someone started documenting "virtual processors" as "processes" which is very confusing. As far as I am concerned the behaviour is OK, just the terminology is totally wrong. Simon will have to explain what he sees as the problem. Note: I am not a Torque user, merely someone who would not like to see confusion amongst users when using variants of PBS.
(In reply to comment #6) > (In reply to comment #5) > > Processes per node is often how it is explained, although you are right, it > > isn't restricted in any way to actually limit the number of processes that can > > be run. It may have originally been intended to be processors per node, but now > > almost all processors intended for computing have multiple cores, making > > processors per node completely ambiguous and therefore not very useful. > > > > However, it is in the code in a few ways: > > > > ppn is the number of times that nodename will appear in the $PBS_NODEFILE. This > > is intended to be read by the mpi scripts on the program to then make that many > > processes. There is nothing in TORQUE that stops the scripts from spawning more > > processes though. > > > > ppn is left completely configurable per node, and so the notion that it is tied > > to the actual hardware is false. Often in production systems, ppn becomes cores > > per node, because that's how many the system admin wants for optimal use. > > > > The fact of the matter is that ppn hasn't been clearly defined over time, and > > what it has become in practice is probably best described as processes per > > node. At any rate, changing this behavior would greatly disrupt life for *very* > > many TORQUE users. > > As Chris Samuel pointed out, the "p" in "ppn" meant "virtual processors". A > "virtual processor" can mean a core - for most us that is exactly what it > means. It can mean an "execution slot" for those sites that set node np > greater than the number of physical cores (or hyperthread contexts). The > important thing is that it is a characteristic of the hardware/system/site. It > is not a property of the job. The number of processes in a job is a property > of a job. In general there is no alignment. > > If I was to run a 16 thread OpenMP job, what value of ppn do I use? The OpenMP > app will have 1 process. But then there will be 2 shells in the job so its > likely to be 3 processes. So ppn=3 ? What I actually want is 16 bits of > hardware that each can run a thread without conflict (as much as possible), > i.e. I want 16 virtual processors. > > Yes, the use of the term "processor" needs to be spelt out as above. But at > least it can be made technically accurate. The use of the term "process" cannot > unless you want to turn it into a property of the system. Maybe it should be called vppn. Believe me, I understand the frustration with the ambiguity of the name. In essence it comes down to the number of "ppns" that will be allowed to be scheduled on the node. Come up with another name for ppn that adequately represents the scheduling limit imposed by the attribute and we could use that in the documentation. But I think the term ppn and its syntax is here to stay.
(In reply to comment #7) > (In reply to comment #6) > > (In reply to comment #5) > > > Processes per node is often how it is explained, although you are right, it > > > isn't restricted in any way to actually limit the number of processes that can > > > be run. It may have originally been intended to be processors per node, but now > > > almost all processors intended for computing have multiple cores, making > > > processors per node completely ambiguous and therefore not very useful. > > > > > > However, it is in the code in a few ways: > > > > > > ppn is the number of times that nodename will appear in the $PBS_NODEFILE. This > > > is intended to be read by the mpi scripts on the program to then make that many > > > processes. There is nothing in TORQUE that stops the scripts from spawning more > > > processes though. > > > > > > ppn is left completely configurable per node, and so the notion that it is tied > > > to the actual hardware is false. Often in production systems, ppn becomes cores > > > per node, because that's how many the system admin wants for optimal use. > > > > > > The fact of the matter is that ppn hasn't been clearly defined over time, and > > > what it has become in practice is probably best described as processes per > > > node. At any rate, changing this behavior would greatly disrupt life for *very* > > > many TORQUE users. > > > > As Chris Samuel pointed out, the "p" in "ppn" meant "virtual processors". A > > "virtual processor" can mean a core - for most us that is exactly what it > > means. It can mean an "execution slot" for those sites that set node np > > greater than the number of physical cores (or hyperthread contexts). The > > important thing is that it is a characteristic of the hardware/system/site. It > > is not a property of the job. The number of processes in a job is a property > > of a job. In general there is no alignment. > > > > If I was to run a 16 thread OpenMP job, what value of ppn do I use? The OpenMP > > app will have 1 process. But then there will be 2 shells in the job so its > > likely to be 3 processes. So ppn=3 ? What I actually want is 16 bits of > > hardware that each can run a thread without conflict (as much as possible), > > i.e. I want 16 virtual processors. > > > > Yes, the use of the term "processor" needs to be spelt out as above. But at > > least it can be made technically accurate. The use of the term "process" cannot > > unless you want to turn it into a property of the system. > > I'm not sure what change Simon wanted but, just to be clear, this looks like a > purely documentation issue to me. The only thing that has changed since the > "good ol' PBS days" is that someone started documenting "virtual processors" as > "processes" which is very confusing. As far as I am concerned the behaviour is > OK, just the terminology is totally wrong. Simon will have to explain what he > sees as the problem. > > Note: I am not a Torque user, merely someone who would not like to see > confusion amongst users when using variants of PBS. It would be awesome if it would be just a documentation issue. Particularly the node interprets ppn as processes. If you look into the code of the server, it doesn't really make any difference, but it still creates a sub-node for each process. One problem with using ppn as cpus/cores is that when you request pmem or pvmem or panything you will get ppn*amount, which can counter intuitive. I personally don't think that per-process resources make much sense these days (since the number of processes isn't limited by Torque anyway). That includes per-process resources. But again either way is OK for me, I just think we should define which way it works.
If you are using cpusets then it is processors per node in that your job is constrained to just the cpus you requested.
(In reply to comment #10) > If you are using cpusets then it is processors per node in that your job is > constrained to just the cpus you requested. Yeah. I'm not talking about what is achievable with the current semantics. Sure, you can done pretty much everything with the current semantics (and that is something that has to be maintained). This is more about cleanup and clarification.
We have made at least the first step in clearing up the confusion around the meaning of ppn. We have updated the documentation in a couple of places. http://www.clusterresources.com/products/torque/docs/1.5nodeconfig.shtml Also in section 2.1.2 under nodes and ppn: http://www.clusterresources.com/products/torque/docs/2.1jobsubmission.shtml I took my definition from David Singleton's comments. They seemed to be the best explaination.
(In reply to comment #5) > Processes per node is often how it is explained, ... > The fact of the matter is that ppn hasn't been clearly defined over time, and > what it has become in practice is probably best described as processes per > node. Describing it as "processes per node" is very misleading and completely inaccurate. Take for example a multi-threaded program. I routinely run multi-threaded code on our cluster. We have 32 cores per node, and if I run a _single process_ that uses 32 threads, I request ppn=32. If that meant _processes_ I would request ppn=1 because, after all, my mult-threaded program is still a single process. It is, however, using multiple-cores. virtual processor per node is the correct definition of ppn - the number of virtual processors will typically be set to the total number of cores on a node. redefining it as processes per node will lead to problems.
(In reply to comment #13) > (In reply to comment #5) > > > Processes per node is often how it is explained, > ... > > The fact of the matter is that ppn hasn't been clearly defined over time, and > > what it has become in practice is probably best described as processes per > > node. > > Describing it as "processes per node" is very misleading and completely > inaccurate. Take for example a multi-threaded program. I routinely run > multi-threaded code on our cluster. We have 32 cores per node, and if I run a > _single process_ that uses 32 threads, I request ppn=32. If that meant > _processes_ I would request ppn=1 because, after all, my mult-threaded program > is still a single process. It is, however, using multiple-cores. > > virtual processor per node is the correct definition of ppn - the number of > virtual processors will typically be set to the total number of cores on a > node. redefining it as processes per node will lead to problems. Glen, I double checked the documentation online and I did use the phrase virtual processor. I tried to be careful not to use the word process or processes.
(In reply to comment #12) > We have made at least the first step in clearing up the confusion around the > meaning of ppn. We have updated the documentation in a couple of places. > > http://www.clusterresources.com/products/torque/docs/1.5nodeconfig.shtml > > Also in section 2.1.2 under nodes and ppn: > http://www.clusterresources.com/products/torque/docs/2.1jobsubmission.shtml > > I took my definition from David Singleton's comments. They seemed to be the > best explaination. I realised later that what I wrote was not sufficiently precise and it has turned into this incorrect line in http://www.clusterresources.com/products/torque/docs/2.1jobsubmission.shtml : "The ppn value is a characteristic of the hardware, system, and site, and its value is to be determined by the administrator." np is a resource attribute of the system and *its* value is determined by the administrator. ppn is a user request (determined by the user) for a quantity of that system resource attribute. I think you can just leave out that line.
(In reply to comment #12) > > Also in section 2.1.2 under nodes and ppn: > http://www.clusterresources.com/products/torque/docs/2.1jobsubmission.shtml Hopefully this: "By default, the node resource is mapped to a virtual node (that is, directly to a processor, not a full physical compute node). " is not true. Hopefully, this is more true (its how our scheduler works at least): "By default, the node resource is mapped to a virtual node (that is, multiple virtual nodes (from one or more jobs) may be allocated to the same physical node (host) provided that all other resource requests can be satisfied by the shared physical host). "
(In reply to comment #16) > (In reply to comment #12) > > > > Also in section 2.1.2 under nodes and ppn: > > http://www.clusterresources.com/products/torque/docs/2.1jobsubmission.shtml > > Hopefully this: > > "By default, the node resource is mapped to a virtual node (that is, directly > to a processor, not a full physical compute node). " > > is not true. Hopefully, this is more true (its how our scheduler works at > least): > > "By default, the node resource is mapped to a virtual node (that is, multiple > virtual nodes (from one or more jobs) may be allocated to the same physical > node (host) provided that all other resource requests can be satisfied by the > shared physical host). " I started to add your corrections to the documentation when I realized we have another item we need to define. That is the host. In context of nodes as a resource a node is not the same as a host. When we are configuring the nodes file we are actually configuring execution hosts. When we are requesting nodes we are requesting parts of each host. Please add any comments you think are appropriate.
> provided that all other resource requests can be satisfied by the > shared physical host). " I would skip this part, since there are no other resource requests in Torque. I would say that the wording is confusing. What about calling it a "slot". Administrator defines how many slots the machine has and each job can request multiple slots on multiple machines. This safely throws away any implied semantics.
I posted this in the torquedev thread generated by bugzilla for this bug: 2010/12/6 Michel Béland <michel.beland@rqchp.qc.ca>: > Later, they introduced -lselect and deprecated -lnodes altogether. Now > one can ask for -lselect=10:ncpus=8:mpiprocs=2:ompthread=4 to get the > same result, if I remember correctly, but I think that I liked ppn and > cpp better... I remember there was talk from some of the TORQUE developers at adaptive about adding a "select" statement to TORQUE. Whatever happened to that? I think it would be great if we could add in a select feature that is compatible (or at least mostly compatible) with the PBS Pro select. Maybe Šimon Tóth's work could get us partially there.
(In reply to comment #18) > > provided that all other resource requests can be satisfied by the > > shared physical host). " > > I would skip this part, since there are no other resource requests in Torque. > > I would say that the wording is confusing. What about calling it a "slot". > Administrator defines how many slots the machine has and each job can request > multiple slots on multiple machines. This safely throws away any implied > semantics. I like the idea of calling this an execution slot. It is generic but also descriptive of what the purpose of np in the nodes file and ppn in a job request.
(In reply to comment #19) > I posted this in the torquedev thread generated by bugzilla for this bug: > > > 2010/12/6 Michel Béland <michel.beland@rqchp.qc.ca>: > > > Later, they introduced -lselect and deprecated -lnodes altogether. Now > > one can ask for -lselect=10:ncpus=8:mpiprocs=2:ompthread=4 to get the > > same result, if I remember correctly, but I think that I liked ppn and > > cpp better... > > > I remember there was talk from some of the TORQUE developers at > adaptive about adding a "select" statement to TORQUE. Whatever > happened to that? I think it would be great if we could add in a > select feature that is compatible (or at least mostly compatible) with > the PBS Pro select. Maybe Šimon Tóth's work could get us partially > there. I was reminded of this at SC'10. Over the summer when we started looking at putting select in the TORQUE resource manager we realized this would actually be best handled by the Scheduler. Even so there seems to be a need to have some basic support for select at the RM level.