Bug 67 - Support for counted resources on nodes
: Support for counted resources on nodes
Status: NEW
Product: TORQUE
pbs_server
: 2.4.x
: PC Linux
: P5 enhancement
Assigned To: Glen
:
:
:
  Show dependency treegraph
 
Reported: 2010-06-30 08:58 MDT by Simon Toth
Modified: 2010-09-07 04:29 MDT (History)
4 users (show)

See Also:


Attachments
Patch (41.29 KB, patch)
2010-06-30 08:59 MDT, Simon Toth
Details | Diff
Server logic (2.00 KB, patch)
2010-07-01 08:01 MDT, Simon Toth
Details | Diff
Doc (1.34 KB, text/plain)
2010-07-04 15:58 MDT, Simon Toth
Details
Server restart logic (used resources are correctly re-added to nodes) (6.29 KB, patch)
2010-07-13 03:38 MDT, Simon Toth
Details | Diff
Resources checking logic (18.39 KB, patch)
2010-07-19 12:02 MDT, Simon Toth
Details | Diff
Total resources support (include resources in nodespec when checking server/queue limits) (8.29 KB, patch)
2010-07-31 20:45 MDT, Simon Toth
Details | Diff
Full patch based on top of 2.4 fixes head (64.88 KB, patch)
2010-07-31 20:47 MDT, Simon Toth
Details | Diff


Note

You need to log in before you can comment on or make changes to this bug.


Description Simon Toth 2010-06-30 08:58:46 MDT
I finally managed to reserve some time to tear this patch from my development
branch.

The feature itself is stable, but I'm not sure that the patch is entirely
complete (some tiny bugfix might have run past me).

What it does:

support on nodes:
  resources_total.resource = value (read-write, can be taken from node)
  resources_used.resource = value (read-only, counted on server)

There are two server attributes that control what resources should be taken
from the node reports.

resources_to_store: list of resources that should be stored
resources_mappings: list of mappings for resources (old=new)

For example, if you want to store reported memory (physmem) and store it as
pmem resource, you could do this:

resources_to-store = physmem
resources_mappings = physmem=pmem

Resources set manually (using qmgr or nodes file) are never overwritten. For
that resources have to be unset.

Server calculates used resources.

The server does not currently prevent jobs requesting more resources then
available from running. This is something that I need to implement on server (I
handle it only on schedulers side now). I will post the patch as soon as it is
done. Should just require patching hasprop() function.
Comment 1 Simon Toth 2010-06-30 08:59:03 MDT
Created an attachment (id=35) [details]
Patch
Comment 2 Simon Toth 2010-06-30 09:00:45 MDT
Oh, I almost forgot. The patch contains some older stuff that I already posted
(and was not yet included). I was to lazy to filter it out... sorry about that.
Comment 3 Garrick Staples 2010-06-30 13:42:37 MDT
I'm certainly just being dense, but I don't yet understand the purpose of this
patch. What problem is being solved? Why would I use this feature?
Comment 4 Simon Toth 2010-06-30 18:29:49 MDT
(In reply to comment #3)
> I'm certainly just being dense, but I don't yet understand the purpose of this
> patch. What problem is being solved? Why would I use this feature?

I can certainly imagine cluster that would be fine with just counting
cpus/processes. We need other resources like memory, disk space...
Comment 5 Simon Toth 2010-07-01 08:01:41 MDT
Created an attachment (id=37) [details]
Server logic

Server logic for checking resources at run requests.
Comment 6 Ken Nielson 2010-07-01 10:19:10 MDT
(In reply to comment #3)
> I'm certainly just being dense, but I don't yet understand the purpose of this
> patch. What problem is being solved? Why would I use this feature?

Simon,

Would you please attach a document which describes the new attributes, how they
are used and what they do for the user.
Comment 7 Simon Toth 2010-07-04 15:58:26 MDT
Created an attachment (id=38) [details]
Doc

Here's the doc, it's short, but should describe the feature well enough.
Comment 8 Chris Samuel 2010-07-04 18:24:21 MDT
Er, isn't that just describing what (p)mem, (p)vmem do already ?
Comment 9 Simon Toth 2010-07-05 06:36:10 MDT
(In reply to comment #8)
> Er, isn't that just describing what (p)mem, (p)vmem do already ?

Not really. If you are talking about the pbs_mom part, then pbs_mom will
enforce system limits for resources that are enforcible this way.

This of course only works for resources like cpu core, walltime, memory, etc...

If you have a special machine with two co-processor cards, you can't have these
cards are a resource (with total value 2).

Please see my reply in the Torque dev list.
Comment 10 Ken Nielson 2010-07-05 10:38:03 MDT
Maybe vmem is not the best resource to use for your example. We are already
enforcing limits on memory. We do not need to manually set them. 

Is this perhaps more useful with special hardware like processor cards?

Ken
Comment 11 Ken Nielson 2010-07-06 14:57:34 MDT
I think I finally understand what you are trying to do. If select is to work we
need to be able to distribute resources such as mem which currently are only
job wide resources instead of node level (contrary to the documentation). Let
me qualify that. It is a job wide resource on the server.

However, I think it would be better to just store the resource information
directly from each node and not create arbitrary information the way we do with
np. We do not even need to have a server parameter. We can change the pbsnode
structure to store these values.

Ken
Comment 12 Simon Toth 2010-07-06 16:03:32 MDT
(In reply to comment #11)
> I think I finally understand what you are trying to do. If select is to work we
> need to be able to distribute resources such as mem which currently are only
> job wide resources instead of node level (contrary to the documentation). Let
> me qualify that. It is a job wide resource on the server.
> 
> However, I think it would be better to just store the resource information
> directly from each node and not create arbitrary information the way we do with
> np. We do not even need to have a server parameter. We can change the pbsnode
> structure to store these values.

OK I probably need to brush up my English :-)

Yes, that is exactly it. You can only request resources for the whole job in
Torque. That's OK for resources that have no location (like licenses). Not so
much for resources that belong to nodes.

The patch is a modification of the pbsnode structure (two resource attributes).
The server parameters are just a convenience, but the whole point of the patch
is to be able to specify resources, that are not reported by the nodes. This is
extremely useful because it allows admins to very quickly specify new counted
resources (for example: disk space, special hardware...).
Comment 13 Ken Nielson 2010-07-06 16:33:01 MDT
(In reply to comment #12)
> (In reply to comment #11)
> > I think I finally understand what you are trying to do. If select is to work we
> > need to be able to distribute resources such as mem which currently are only
> > job wide resources instead of node level (contrary to the documentation). Let
> > me qualify that. It is a job wide resource on the server.
> > 
> > However, I think it would be better to just store the resource information
> > directly from each node and not create arbitrary information the way we do with
> > np. We do not even need to have a server parameter. We can change the pbsnode
> > structure to store these values.
> 
> OK I probably need to brush up my English :-)
> 
> Yes, that is exactly it. You can only request resources for the whole job in
> Torque. That's OK for resources that have no location (like licenses). Not so
> much for resources that belong to nodes.
> 
> The patch is a modification of the pbsnode structure (two resource attributes).
> The server parameters are just a convenience, but the whole point of the patch
> is to be able to specify resources, that are not reported by the nodes. This is
> extremely useful because it allows admins to very quickly specify new counted
> resources (for example: disk space, special hardware...).

I see the utility in administrators having the ability to just set values for
resources which may not be reported by the nodes. But how do you attach a value
to a specific node with this syntax?

Ken
Comment 14 Simon Toth 2010-07-07 04:11:21 MDT
(In reply to comment #13)
> (In reply to comment #12)
> > (In reply to comment #11)
> > > I think I finally understand what you are trying to do. If select is to work we
> > > need to be able to distribute resources such as mem which currently are only
> > > job wide resources instead of node level (contrary to the documentation). Let
> > > me qualify that. It is a job wide resource on the server.
> > > 
> > > However, I think it would be better to just store the resource information
> > > directly from each node and not create arbitrary information the way we do with
> > > np. We do not even need to have a server parameter. We can change the pbsnode
> > > structure to store these values.
> > 
> > OK I probably need to brush up my English :-)
> > 
> > Yes, that is exactly it. You can only request resources for the whole job in
> > Torque. That's OK for resources that have no location (like licenses). Not so
> > much for resources that belong to nodes.
> > 
> > The patch is a modification of the pbsnode structure (two resource attributes).
> > The server parameters are just a convenience, but the whole point of the patch
> > is to be able to specify resources, that are not reported by the nodes. This is
> > extremely useful because it allows admins to very quickly specify new counted
> > resources (for example: disk space, special hardware...).
> 
> I see the utility in administrators having the ability to just set values for
> resources which may not be reported by the nodes. But how do you attach a value
> to a specific node with this syntax?

Using qmgr:

set node (some_node) resources_total.(some_resource) = (some_value)

example:

set node clusterNode123 resources_total.ncpus=4
Comment 15 Simon Toth 2010-07-13 03:38:54 MDT
Created an attachment (id=39) [details]
Server restart logic (used resources are correctly re-added to nodes)

Another part of the server logic. When server is restarted, used resources are
correctly re-added to nodes.
Comment 16 Simon Toth 2010-07-13 03:40:48 MDT
I was expecting some comments on the implementation or other features that
might be added :-)

Anyone?
Comment 17 Simon Toth 2010-07-19 12:02:27 MDT
Created an attachment (id=40) [details]
Resources checking logic

Additional logic to schedule resource requests passed through 
  qsub -l resource=value

This patch adds support for -l resource=value requests combined with nodespec.

Example:
  qsub -l nodes=10 -l ncpus=4 -l mem=4G



Two types of resources are supported. Per-proc and per-node.

Per-proc resources are counted for each process:
  -l nodes=1:ppn=2:ncpus=3
  if ncpus are per-proc then this is request for 2 processes, 3 cpus each = 6
cpus total

Per-node resources are counted for each node:
  -l nodes=1:ppn=2:mem=1G
  if mem is per-node then this is request for 2 processes, 1G memory on node =
1G memory total

per-proc and per-node are set using flag in the resc_def_all.c file.
Comment 18 Simon Toth 2010-07-31 20:45:23 MDT
Created an attachment (id=45) [details]
Total resources support (include resources in nodespec when checking
server/queue limits)

This is the last part of the server logic for now. It adds support for checking
resource limits included both in nodespec and resource requests.

-l nodes=2:ncpus=10

Will be checked against ncpus limit on both server and queue (this is a request
for 20 ncpus).

It would be great to get some feedback on this whole pack of patches.
Comment 19 Simon Toth 2010-07-31 20:47:53 MDT
Created an attachment (id=46) [details]
Full patch based on top of 2.4 fixes head

Just for convenience. This is the full patch based of 2.4 fixes.
Comment 20 Chris Samuel 2010-08-03 23:50:26 MDT
Sorry for the delay in commenting Simon, been flat out bringing up new systems!

Can you comment on how this interacts with the various schedulers please ?

When a job is submitted and Maui/Moab/pbs_sched is working out where to put it
will it take these limits into account, or will the pbs_server just refuse to
start it if a limit is exceeded ?
Comment 21 Simon Toth 2010-08-04 07:44:51 MDT
(In reply to comment #20)
> Sorry for the delay in commenting Simon, been flat out bringing up new systems!
> 
> Can you comment on how this interacts with the various schedulers please ?
> 
> When a job is submitted and Maui/Moab/pbs_sched is working out where to put it
> will it take these limits into account, or will the pbs_server just refuse to
> start it if a limit is exceeded ?

The whole point of patch is to make the server a request verification
authority.

There are two checkpoints.

(1)
The submit is now checked not just against the list of resources on the job,
but also against the resources in the nodespec.

qsub -l mem=4G -l nodes=10:ncpus=5

translates into 4G memory and 50 ncpus and is checked against server limits.

The ncpus part wouldn't normally be checked. So this is the first place where
the job can be rejected, but normally wouldn't be.

This shouldn't be a problem I guess.


(2)
Upon run, the server receives a nodespec from the scheduler. This is the mostly
incompatible part. If the request does not contain any nodespec, the original
one submitted is used, if there is a nodespec, the nodespec is parsed.
Therefore the functionality pretty much depends on what the scheduler keeps in
the nodespec when sending a run request.

What this can lead to is that if the scheduler is set to an incompatible mode
(thinking that some resources do not exist, or they are per-proc instead of
per-node) his run requests can be denied by the server.
Comment 22 Simon Toth 2010-08-04 07:54:45 MDT
> When a job is submitted and Maui/Moab/pbs_sched is working out where to put it
> will it take these limits into account, or will the pbs_server just refuse to
> start it if a limit is exceeded ?

Original pbs_sched just sends empty run requests in the order of job priority.

I'm not an expert on Maui/Moab, but I would say that they have to deal with
this situation anyway. There are situations (with the current server) when the
job request will be rejected (node down, someone run qrun) or even run on
different nodes then requested by the scheduler.
Comment 23 Ken Nielson 2010-08-04 10:41:47 MDT
(In reply to comment #17)
> Created an attachment (id=40) [details] [details]
> Resources checking logic
> 
> Additional logic to schedule resource requests passed through 
>   qsub -l resource=value
> 
> This patch adds support for -l resource=value requests combined with nodespec.
> 
> Example:
>   qsub -l nodes=10 -l ncpus=4 -l mem=4G
> 
> 
> 
> Two types of resources are supported. Per-proc and per-node.
> 
> Per-proc resources are counted for each process:
>   -l nodes=1:ppn=2:ncpus=3
>   if ncpus are per-proc then this is request for 2 processes, 3 cpus each = 6
> cpus total
> 
> Per-node resources are counted for each node:
>   -l nodes=1:ppn=2:mem=1G
>   if mem is per-node then this is request for 2 processes, 1G memory on node =
> 1G memory total
> 
> per-proc and per-node are set using flag in the resc_def_all.c file.

This is where some of the keyword ambiguity affects things. ppn is processors
per node. That is how TORQUE looks at ppn internally. Virtually there is no
difference between ppn and ncpus.
Comment 24 Simon Toth 2010-08-04 10:47:08 MDT
(In reply to comment #23)
> (In reply to comment #17)
> > Created an attachment (id=40) [details] [details] [details]
> > Resources checking logic
> > 
> > Additional logic to schedule resource requests passed through 
> >   qsub -l resource=value
> > 
> > This patch adds support for -l resource=value requests combined with nodespec.
> > 
> > Example:
> >   qsub -l nodes=10 -l ncpus=4 -l mem=4G
> > 
> > 
> > 
> > Two types of resources are supported. Per-proc and per-node.
> > 
> > Per-proc resources are counted for each process:
> >   -l nodes=1:ppn=2:ncpus=3
> >   if ncpus are per-proc then this is request for 2 processes, 3 cpus each = 6
> > cpus total
> > 
> > Per-node resources are counted for each node:
> >   -l nodes=1:ppn=2:mem=1G
> >   if mem is per-node then this is request for 2 processes, 1G memory on node =
> > 1G memory total
> > 
> > per-proc and per-node are set using flag in the resc_def_all.c file.
> 
> This is where some of the keyword ambiguity affects things. ppn is processors
> per node. That is how TORQUE looks at ppn internally. Virtually there is no
> difference between ppn and ncpus.

There are no ncpus in Torque server. The only thing that server understands is
ppn. As already discussed several times in the mailing list. PPN doesn't have
the semantics of either processor or processes.
Comment 25 Ken Nielson 2010-08-04 11:14:39 MDT
> There are no ncpus in Torque server. The only thing that server understands is
> ppn. As already discussed several times in the mailing list. PPN doesn't have
> the semantics of either processor or processes.

My post stands corrected. ncpus is not in TORQUE. However, ppn does mean
processors per node
(http://www.clusterresources.com/products/torque/docs/2.1jobsubmission.shtml#resources
- see nodes) The -l nodes=1:ppn=2:ncpus=3 creates more ambiguity. 

Processes is not a well defined term for TORQUE. What does it mean to have 2
processes with 3 cpus each. TORQUE views process and processor as the same
thing.
Comment 26 Simon Toth 2010-08-04 15:43:05 MDT
(In reply to comment #25)
> > There are no ncpus in Torque server. The only thing that server understands is
> > ppn. As already discussed several times in the mailing list. PPN doesn't have
> > the semantics of either processor or processes.
> 
> My post stands corrected. ncpus is not in TORQUE. However, ppn does mean
> processors per node
> (http://www.clusterresources.com/products/torque/docs/2.1jobsubmission.shtml#resources
> - see nodes) The -l nodes=1:ppn=2:ncpus=3 creates more ambiguity. 
> 
> Processes is not a well defined term for TORQUE. What does it mean to have 2
> processes with 3 cpus each. TORQUE views process and processor as the same
> thing.

The combination of procs and ncpus is taken from pbs pro.

Anyway, this patch introduces generic resources. If you don't want ncpus, you
simply don't set them on the node.
Comment 27 Simon Toth 2010-08-04 22:37:26 MDT
> The combination of procs and ncpus is taken from pbs pro.

I just checked PBSPro manual (v9.2), and they already marked ppn as deprecated
and are using just ncpus (in the -l select=).

But the old logic works as I described it with ncpus being a per-proc resource.
Comment 28 Simon Toth 2010-09-07 04:29:14 MDT
Any progress?