[torqueusers] building a department GPU cluster

Roberto Nunnari roberto.nunnari at supsi.ch
Tue Jan 22 06:27:01 MST 2013


Hi all.
Thank you very much for your time and help.
Best regards.
Robi


Brock Palen wrote:
>> Roberto Nunnari wrote:
>>> Hi all.
>>>
>>> I'm writing to you to ask for advice or a hint to the right direction.
>>>
>>> In our department, more and more researchers ask us (IT administrators) 
>>> to assemble (or to buy) GPGPU powered workstations to do parallel computing.
>>>
>>> As I already manage a small CPU cluster (resources managed using SGE), 
>>> with my boss we talked about building a new GPU cluster. The problem is 
>>> that I have no experience at all with GPU clusters.
> 
> I think SGE (OGE) understands allocating GPU's I would check with them.  If they have problems Torque does support GPU allocation, you can see an example in our docs:
> http://cac.engin.umich.edu/resources/software/cuda.html
> 
>>> Apart from the already running GPU workstations, we already have some 
>>> new HW that looks promising to me as a starting point for temporary 
>>> building and testing a GPU cluster.
>>>
>>> - 1x Dell PowerEdge R720
>>> - 1x Dell PowerEdge C410x
>>> - 1x NVIDIA M2090 PCIe x16
>>> - 1x NVIDIA iPASS Cable Kit
> 
> This is really outside the scope of this list for the Torque resource manager, You probably want to get on a generic HPC list or ping me offline to keep the list on topic.
> 
> I have all that gear and I would call it all last generation.
> 
>>> I'd be grateful if you could kindly give me some advice and/or hint to 
>>> the right direction.
>>>
>>> In particular I'm interested on your opinion on:
>>> 1) is the above HW suitable for a small (2 to 4/6 GPUs) GPU cluster?
> 
> Yes sub optimally, but I wouldn't buy M20xx cards, I would only get k20 based cards now that they are available. 
> 
>>> 2) is torque suitable (or what should we use?) as a queuing and resource 
>>> management system? We would like the cluster to be usable by many users 
>>> at once in a way that no user has to worry about resources, just like we 
>>> do on the CPU cluster with SGE.
> 
> Torque does work well and supports very large systems (ours is 1200 nodes 14,000 CPU cores and a few hundred GPU's) we pair it with Moab for scheduling. 
> 
>>> 3) What distribution of linux would be more appropriate?
> 
> Doesn't really mater, but stick wtih the main ones, RedHat, CentOS, Debian.  To use the GPU's you have to install Nvidias binary stuff and they won't test on everything and dev for <insert obsure linux distro here> won't have the source access to make it work.
> 
>>> 4) necessary stack of sw? (cuda, torque, hadoop?, other?)
> 
> To just have GPU's only cuda and the Nvidia driver are requires (read about nvidia-smi heavily),
> 
> Torque (or SGE in your case) if you want to use a batch queue system.
> 
> Hadoop doesn't even belong in this discussion currently.
> 
>>> Thank you very much for your valuable insight!
>>>
>>> Best regards.
>>> Robi
>> Anybody on this, please?
>> Robi



More information about the torqueusers mailing list