[Mauiusers] procs= not working as documented
lance at quantumbioinc.com
Tue Feb 14 11:13:51 MST 2012
(I apologize if you receive this email twice. I'm unsure whether it is a problem in torque, maui, or both and therefore I also posted it to the torque list).
We're still having trouble with this feature, and we are starting to shop around for a torque/maui replacement in order to be able to use it. Before we do that however, I wanted to see if anyone has any thoughts on how to address the problem within torque/maui. Perhaps I simply don't understand the feature. The versions of torque and maui we are using are:
Yes, we have tried newer versions of maui, but then the option doesn't work at all.
Here is the scenario (I also included the conversation from November below for more information).
Conceptually, our software is almost infinitely scalable in the sense that there is very little overhead associated with interprocess communication. Therefore, we do not require that all of the processes reside on a small number of nodes. In fact, we can stretch the processors to any and all nodes in the cluster with ~zero loss in performance. So we can literally have one node that has a single process running and another node that has 8 processes running. Since we have that level of scalability, we don't want to have to lock ourselves into having to request resources using the "nodes=X:ppn=Y" style since this style requires that nodes open up or drain in order to use them. Since our users have a big mixture of single and multi-processor jobs, waiting for node drain can really waste a lot of resources.
I saw the "procs=#" the Requesting Resources table (see http://www.clusterresources.com/torquedocs/2.1jobsubmission.shtml#resources for more). It *appears* that this option should be able to allow the user to request simply X*Y processors and the scheduler should be able to schedule them any way it can fit. So using the following #PBS note, we should be able to request 40 processors:
#PBS -l procs=40
Instead, we see that the scheduler seems to take this information, read it, and basically disregard it. The reason I know it reads it is because if I ask for say 40 processors and 40 processors are available in the cluster, it works as expected and all is right with the world. Where it gets a bit more choppy is when I ask for 40 processors and only 1 processor is available. The job doesn't wait in the queue for the remaining 39 processors to open up, and instead PBS simply just starts the job on that processor. I can't see how that is anything but a bug. If the user is asking for 40 processors, why isn't the scheduler waiting for all 40 processors to open up?
If answering this question will require additional information, please ask. We are at our wits end here.
On Nov 18, 2011, at 9:39 AM, Lance Westerhoff wrote:
> Hello All-
> I submitted the following to the torque list, but the more I look at it, the more I think it might be a scheduler problem. It appears that when running with the following specs, the procs= option does not actually work as expected.
> #PBS -S /bin/bash
> #PBS -l procs=60
> #PBS -l pmem=700mb
> #PBS -l walltime=744:00:00
> #PBS -j oe
> #PBS -q batch
> torque version: tried 3.0.2. in v2.5.4, I think the procs option worked as documented
> maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete fail in terms of the procs option and it only asks for a single CPU)
> If there are fewer then 60 processors available in the cluster (in this case there were 53 available) the job will go in an take whatever processors are remaining instead of waiting for all 60 processors to free up. Any thoughts as to why this might be happening? Sometimes it doesn't really matter and 53 would be almost as good as 60, however if only 2 processors are available and the user asks for 60, I would hate for him to go in.
> Thank you for your time!
More information about the mauiusers