[torqueusers] specific nodes

Gustavo Correa gus at ldeo.columbia.edu
Wed Nov 30 17:21:33 MST 2011


Answers inline
On Nov 30, 2011, at 5:38 PM, Ricardo Román Brenes wrote:

> Thank you so much for your help =) yet I still have matters to discuss.
> 
> 
> On Wed, Nov 30, 2011 at 4:22 PM, Gustavo Correa <gus at ldeo.columbia.edu> wrote:
> You don't have 8 CPUs of type 'uno'.
> This seems to conflict with your mpirun command with -np=8.
> You need to match the number of processors you request from Torque and
> the number of processes you launch with mpirun.
> 
> 
> 
> 1. Why there has to be a match between processors and processes? i could run 1024 process in 1 processor (without torque). Requesting 2 nodes i could spawn 10000 processes...
> 

You can oversubscribe the processors with MPI tasks, if you want.
The MPI distributions brag that you can do it, and in many cases it works alright.

In general, if your MPI tasks are of 'hello world' type, oversubscribing is not a problem,
you can run thousands of processes in a trifle of CPUs.
However, if you are doing real HPC, then it is another story.
Switching context very often doesn't seem to get along very well with MPI.
Paging to disk will be a killer most likely.

As far as I know, Torque is designed *not* to oversubscribe CPUs.
Resource managers like Torque are designed in principle for HPC (but not only),
so they tend to have this underlying assumption of "one processor for each process".


If you want to oversubscribe with Torque, trick it by setting a larger number of processors
in the $Torque/server_priv/nodes file [e.g. np=10000 instead of np=2].
It will probably run hello world right.
Then try heavier algorithms, and deal with the consequences .... :)


>  
> Also, you wrote:
> 
> #PPS -q uno
> 
> Is this a typo in your email or in your Torque submission script?
> It should be:
> 
> #PBS -q uno
> 
> In addition, your PBS script doesn't request nodes, something like
> #PBS -l nodes=1:ppn=2
> I suppose it will use the default for the queue uno.
> However, your qmgr configuation doesn't set a default number of nodes to use,
> either for the queues or for the server itself.
> 
> You could do:
> qmgr -c 'set queue uno resources_default.nodes = 1'
> and likewise for queue dos.
> 
> 
> 
> 2. thats in fact a type. In the script it says #PBS
> 
> 
>  
> More important, is your mpi [and mpiexec] built with Torque support?
> For instance, OpenMPI can be built with Torque support, so that it
> will use the nodes provided by Torque to run the job.
> However, stock packaged MPIs from yum or apt-get are probably not
> integrated with Torque.
> You would need to build it from source, which is not really hard.
> 
> If you use an mpi that is not integrated with Torque, you need to pass to mpirun/mpiexec
> the file created by Torque with the node list.
> The file name is held by the environment variable $PBS_NODEFILE.
> The syntax vary depending on which mpi you are using, check your mpirun man page,
> but should be something like:
> 
> mpirun -hostfile $PBS_NODEFILE -np 2  ./a.out
> 
> 
> 3. My MPICH2 is version 1.2.1p1. I dont recall if i compiled it with torque support. Even so i dont' have a vairable $PBS_NODEFILE. (doing a "echo $PBS_NODEFILE" returns an empty line).
> 

It is a Torque variable.  You will have it inside the Torque submission script only,
not on your Linux shell per se.
Try "echo $PBS_NODEFILE" and "cat $PBS_NODEFILE" inside the Torque script.

If you don't have Torque support in your MPICH2 you definitely 
need to pass the -machinefile or -hostfile $PBS_NODEFILE.
In fact, if you already set some default machinefile/hostfile in your MPICH2 directories,
you may be using that file inadvertently, instead of the nodes that Torque gives to  your job.
Did you set a default machine file in MPICH2?
Does it contain all of your cluster?
This may perhaps explain why your job executes in nodes that you didn't expect to it to be.

> 
> 4. I dont know if this is my problem or not but you talk about mpirun and mpiexec like if they were the same, yet i have used mpiexec most of the time and im not sure about the similiarities (or differences). You asked if my MPIEXEC is built with torque but a few lines below you mention MPIRUN

The traditional name is mpirun, most MPIs changed to mpiexec, many have both,
sometimes just a soft link or alias to each other.
Check what you have.
Be careful if you installed several different MPIs, make sure you know exactly which
one you are using to compile [mpicc,mpif90] and use the same to run [mpirun/mpiexec].
They *don't* mix well.

Gus Correa

>  
> [ The flag may be -machinefile instead of -hostfile, or something else, depending on your MPI.]
> 
> 
> On Nov 30, 2011, at 4:11 PM, Ricardo Román Brenes wrote:
> 
> > Ill post some more info since im pretty desperate right now :P
> >
> 
> Oh, yes.
> You should always do this, if you want help from the list.
> Do you see how much more help you get when you give all the information?  :)
> 
> 
> I hope this helps,
> Gus Correa
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list