How To Build and Use a Multi GPU System for Deep Learning

When I started using GPUs for deep learning my deep learning skills improved quickly. When you can run experiments of algorithms and algorithms with different parameters and gain rapid feedback you can just learn much more quickly. At the beginning, deep learning is a lot of trial and error: You have to get a feel what parameters need to be adjusted, or what puzzle piece is missing in order to get a good result. A GPU helps you to fail quickly and learn important lessons so that you can keep improving. Soon my deep learning skills were sufficient to take the 2nd place in the Crowdflower competition where the task was to predict weather labels from given tweets (sunny, raining etc.). After this success I was tempted to use multiple GPUs in order to train deep learning algorithms even faster. I also took interest in learning very large models which do not fit into a single GPU. I thus wanted to build a little GPU cluster and explore the possibilities to speed up deep learning with multiple nodes with multiple GPUs. At the same time I was offered to do contract work as a data base developer through my old employer. This gave me opportunity to get the money to build the GPU cluster I thought of. Important components in a GPU cluster When I did my research on which hardware to buy I soon realized, that the main bottleneck will be the network bandwidth, i.e. how much data can be transferred from computer to computer per second. The network bandwidth of network cards (affordable cards are at about 4GB/s) does not come even close to the speed of PCIe 3.0 bandwidth (15.75 GB/s). So GPU-to-GPU communication within a computer will be fast, but it will be slow between computers. On top of that most network card only work with memory that is registered with the CPU and so the GPU to GPU transfer between two nodes would be like this: GPU 1 to CPU 1 to Network Card 1 to Network Card 2 to CPU 2 to GPU 2. What this means is, if one chooses a slow network card then there might be no speedups over a single computer. Even with fast network cards, if the cluster is large, one does not even get speedups from GPUs when compared to CPUs as the GPUs just work too fast for the network cards to keep up with them. This is the reason why many big companies like Google and Microsoft are using CPU rather than GPU clusters to train their big neural networks. Luckily, Mellanox and Nvidia recently came together to work on that problem and the result is GPUDirect RDMA, a network card driver that can make sense of GPU memory addresses and thus can transfer data directly from GPU to GPU between computers.

GPU pic - Copy
NVIDIA GPUDirect RDMA can bypass the CPU for inter-node communication – data is directly transfered between two GPUs.

Generally your best bet for cheap network cards is eBay. I won an auction for a set of 40Gbit/s Mellanox network cards that support GPUDirect RDMA along with the fitting fibre cable on eBay. I already had two GTX Titan GPUs with 6GB of memory and as I wanted to build huge models that do not fit into a single memory, so I decided to keep the 6GB cards and buy more of them to build a cluster that features 24GB memory. In retrospect this was a rather foolish (and expensive) idea, but little did I know about the performance of such large models and how to evaluate the performance of GPUs. All the lessons I learned from this can be found here. Besides that the hardware is rather straightforward. For fast inter-node communication PCIe 3.0 is faster than PCIe 2.0, so I got a PCIe 3.0 board. It is also a good idea to have about two times the RAM than you have GPU memory to be able to work more freely to handle big nets. As deep learning programs use a single thread for a GPU most of the time, a CPU with as many cores as GPUs you have is often sufficient. Hardware: Check. Software: ? There are basically two options how to do multi-GPU programming. You do it in CUDA and have a single thread and manage the GPUs directly by setting the current device and by declaring and assigning a dedicated memory-stream to each GPU, or the other options is to use CUDA-aware MPI where a single thread is spawned for each GPU and all communication and synchronization is handled by MPI. The first method is rather complicated as you need to create efficient abstractions where you loop through the GPUs and handle streaming and computing. Even with efficient abstractions your code can blow up quickly in line count making it less readable and maintainable.

GPU pic - Copy
Some sample MPI code. The first action spreads one chuck of data to all others computer in the network; the second action receives one chuck of data from every process. That is all you need to do, it is very easy!

The second option is much more efficient and clean. MPI is the standard in high performance computing and its standardized library means that you can be sure that a MPI method really does what it is supposed to do. Underlying MPI the same principles are used as in the first method describes above, but the abstraction is so good that it is quite easy to adapt single GPU code to multiple GPU code (at least for data parallelism). The result is clean and maintainable code and as such I would always recommend using MPI for multi-GPU computing.  As MPI libraries come in many languages and you can pair them with the language of your choice. With these two components you are ready to go and can immediately start programming deep learning algorithms for multiple GPUs. [Image source:  NVIDIA GPUDirect Key Technologies]

How to Parallelize Deep Learning on GPUs Part 2/2: Model Parallelism

In my last blog post I explained what model and data parallelism is and analysed how to use data parallelism effectively in deep learning. In this blog post I will focus on model parallelism.

To recap, model parallelism is, when you split the model among GPUs and use the same data for each model; so each GPU works on a part of the model rather than a part of the data. In deep learning, one approach is to do this by splitting the weights, e.g. a 1000×1000 weight matrix would be split into a 1000×250 matrix if you use four GPUs.

Model parallelism diagram. Synchronizing communication is needed after each dot product with the weight matrix for both forward and backward pass.

One advantage of this approach is immediately apparent: If we split the weights among the GPUs we can have very large neural networks which weights would not fit into the memory of a single GPU. In part I mentioned this in an earlier blog post, where I also said that such large neural networks are largely unnecessary. However, for very big unsupervised learning tasks – which will become quite important in the near future – such large networks will be needed in order to learn fine grained features that could learn “intelligent” behavior.

How does a forward and backward pass work with such split matrices? This is most obvious when we do the matrix algebra step by step:

We start looking at {\bf{A}\,\bf{B} = \bf{C}}  which would be the dot matrix multiply for the usual forward pass case. The dimensions for using model parallelism with two GPUs for a batch size of 128 and a 1000×500 weight matrix would be:

Standard: 128×1000 dot 1000×500 = 128×500

Split by weight matrix first dimension: 128×500 dot 500×500 = 128×500 -> add matrices

Split by weight matrix second dimension: 128×1000 dot 1000×250 = 128×250 -> stack matrices

To calculate the errors in the layer below we need to pass the current error through to the next layer, or more mathematically, we calculate the deltas by taking the dot product of the error of the previous layer {i} and the weights that connect to the next layer {j} , i.e. {\bf{\delta_j}\,\bf{W^T} = \bf{\delta_i}}:

Standard: 128×500 dot 500×1000 = 128×1000

Split by weight matrix first dimension: 128×500 dot 500×500 = 128×500 -> stack matrices

Split by weight matrix second dimension: 128×250 dot 250×1000 = 128×1000 -> add matrices

We see here, we need to synchronize (adding or stacking weights) after each dot product and you may think that this is slow when compared to data parallelism, where we synchronize only once. But one can quickly see that this is not so for most cases if we do the math: In data parallelism a 1000×500 gradient needs to be transferred once for the 1000×500 layer – that’s 500000 elements; for model parallelism we just need to transfer a small matrix for each forward and backward pass with a total of 128000 or 160000 elements – that’s nearly 4 times less data! So the network card bandwidth is still the main bottleneck in the whole application, but much less so than in the data parallelism case.

This is of course all relative and depends on the network architecture. Data parallelism will be quite fast for small networks and very slow for large networks, the opposite is true for model parallelism. The more parameters we have, the more beneficial is model parallelism. Its true strength comes to play if you have neural networks where the weights do not fit into a single GPU memory. Here model parallelism might achieve that for which one would need thousands of CPUs.

However, if you run small networks where the GPUs are not saturated and have some free capacity (not all cores are running), then model parallelism will be slow. Unlike data parallelism, there are no tricks you can use to hide the communication needed for synchronization, this is because we have only partial information for the whole batch. With this partial information we cannot compute the activities in the next layer and thus have to wait for the completion of the synchronization to move forward.

How the advantages and disadvantages can be combined is best shown by Alex Krizhevsky who demonstrates the efficiency of using data parallelism in the convolutional layers and model parallelism in the dense layers of a convolutional neural network.

How to Parallelize Deep Learning on GPUs Part 1/2: Data Parallelism

In my last blog post I showed what to look out for when you build a GPU cluster. Most importantly, you want a fast network connection between your servers and using MPI in your programming will make things much easier than to use the options available in CUDA itself.

In this blog post I explain how to utilize such a cluster to parallelize neural networks in different ways and what the advantages and downfalls are for such algorithms. The two different algorithms are data and model parallelism. In this blog entry I will focus on data parallelism.

So what are these two? Data parallelism is when you use the same model for every thread, but feed it with different parts of the data; model parallelism is when you use the same data for every thread, but split the model among threads.

For neural networks this means that data parallelism uses the same weights and but different mini-batches in each thread; the gradients need to be synchronized, i.e. averaged, after each pass through a mini-batch.

Model parallelism splits the weights of the net equally among the threads and all threads work on a single mini-batch; here the generated output after each layer needs to be synchronized, i.e. stacked, to provide the input to the next layer.

Each method has its advantages and disadvantages which change from architecture to architecture. Let us look at data parallelism first and its bottlenecks first and in the next post I will look at model parallelism.

Severity of the network bottleneck of data parallelism

The idea of data parallelism is simple. If you have, say, 4 GPUs you split a mini-batch into parts for each of them, say, you split a mini-batch with 128 examples into 32 examples for each GPU. Then you feed the respective batch through the net and obtain gradients for each split of the mini-batch. You then use MPI to collect all the gradients and update the parameters with the overall average.

Featured image
Data parallelism diagram. There is no communication in the forward pass, and during the backward pass you synchronize gradients.

The biggest problem with this approach is that during the backward pass you have to pass the whole gradient to the all other GPUs. If you have a 1000×1000 weight matrix then you need to pass 4000000 bytes to each network. If we take a 40Gbit/s network card – which is already quite fast – then you will need {\frac{4000000}{40}\frac{1}{40\times 1024\times 1024 \times 102}\frac{1}{8\times 1000} = 0.75\mbox{ms}}  to pass the data from one node to another node (however, there is some additional overhead that is neglected here). If you have six GPUs in two nodes you need to pass the data to five other GPUs, three of which need to go through the network card (3x 0.75ms), while two can use PCIe 3.0 to pass the data to the other two GPUs (about three times as fast; 2x 0.25ms). However, the PCIe pass is independent of the network card pass, so the time needed is determined by the network card time alone, i.e. 2.25ms. However, only one GPU can transfer data through the network card at any one time in any one node, so that we have to multiply that time by three, i.e. 7.75ms. Now the bottom line is, that we just need about 0.2ms for a matrix multiply through that layer (100×1000 dot 1000×1000) and about twice as much for the backward pass. We can pass the gradient while we work on the next layer, but in the end the network card speed limits our overall computation by quite a bit. This is more marked the larger you scale your system: A four node system working on the same problem needs about 20.25ms to pass the gradients around to the other GPUs. One can easily see that data parallelism does not scale with size of the cluster.

To counter this bottleneck is to reduce the parameters of the gradient through max pooling, maxout units or by simply using convolution. Another way is to increase the computational time/network time ratio by other means, e.g. by using is computationally intensive optimization techniques like RMSProp. You need the same time to pass the gradients to each other, but more time is spend on computation, thus increasing the utility of the fast GPUs.

Another thing you can do when you use computationally intensive optimization techniques is to hide latency of networking under the computation of the gradients. This means while you passing the first gradient to all other nodes, you can already start a big RMSProp computation asynchronously for the next layer. This technique can give a speedup of about 0-20 % depending on network architecture.

But this is not the only problem with data parallelism. There is a very technical bottleneck hidden in the GPU architecture which took me quite a while to understand. To understand why the GPU architecture is a problem we first need to look at the usage and purpose of mini-batches.

A divergence: Why do we use mini-batches?

If we start with randomly initialized parameters or even if we start with pretrained parameters, we do not need a pass through all the data to get an accurate gradient update that will head into the direction of a local minimum. If we take MNIST as an example, if we have a gradient which includes 10 common mistakes that the network does for each class (mini-batch size of about 128), then we will go into a direction that reduces the error greatly already as the gradient captures rough and common mistakes. If we choose a greater batch size (say 512) then we not only capture common errors, but also catch errors that are more subtle. However, it is not very sensible to fine-tune a system if you know it still has major errors. So overall we gain little by increasing the batch size. We need more computation to do roughly the same and this is the main argument why we use a mini-batch size as small as possible. However, if we choose a mini-batch size that is too small, then we do not capture all the common errors which are relevant for the data set and thus our gradient might not head near a local optimum, so there is a limit how small you can make mini-batches.

How does this relate to data parallelism? If we want a mini-batch size of 128 and use data parallelism to divide it among, say, eight GPUs, then each net calculates gradients for 16 samples which is then averages with the data from the other GPUs. And exactly here kicks the hardware bottleneck in.

Memory tiles: Patches of fast GPU memory for efficient dot product calculations

To calculate dot products on the GPU, you need to copy small patches, called memory tiles, into shared memory, i.e. very fast but very small memory (limited to a few kilobytes). The problem is that the standard cuBLAS uses either a 64×128 memory tiles and when you have a batch size less than 64 you waste a lot of precious shared memory. Also if you use a batch size not equal to a multiple of 32 you equally waste shared memory (threads are only started in blocks of 32 threads), so one should use a batch size which is a multiple of 32 or multiple of 64 if possible. For data parallelism this means that you lose significant processing speed once you go below a batch size of 64 for each GPU. If you have many GPUs this can be quite limiting and this is yet another reason why the data parallelism approach does not scale well beyond a certain point.

All in all this sounds quite dire for data parallelism, but data parallelism has its uses. If you know the bottlenecks, you can wield data parallelism as a might tool for certain applications. This is demonstrated by Alex Krishevsky in his paper where he uses data parallelism in the convolutional layers of his net, and thus achieves a speedup of 3.74x by using four GPUs and 6.25x using eight GPUs. His system features two CPUs and 8 GPUs in one node, so he can use the full PCIe speed for the two sets of four GPUs and relatively fast PCIe connection between CPUs to distribute the data among all eight GPUs.

Besides convolutional neural networks, another use of data parallelism might be to use it in recurrent neural networks, which typically have less parameters and highly computationally intensive gradient updates – both are wins for data parallelism.

In my next blog post I will focus on model parallelism, which is efficient for large networks and scales well to larger clusters.

Which GPU(s) to Get for Deep Learning: My Experience and Advice for Using GPUs in Deep Learning

Why should I get a GPU?

I have been using GPUs for nearly two years now and for me it is again and again amazing to see how much speedup you get. Compared to CPUs 20x speedups are typical, but on larger problems one can achieve 50x speedups. With GPUs you can try out new ideas, algorithms and experiments much faster than usual and get almost immediate feedback as to what works and what does not. This is a very important aspect when one begins to do deep learning as this rapid gain in practical experience is key to build the expertise with which you can make deep learning work on new problems. Without this rapid feedback it just takes too much time to learn from one’s mistakes and it can be discouraging and frustrating to go on with deep learning.

With GPUs I quickly learned how to apply deep learning on a range of Kaggle competitions and I managed to earn second place in the Partly Sunny with a Chance of Hashtags Kaggle competition, where it was the task to predict weather ratings for a given tweet. In the competition I used a rather large two layered deep neural network with rectified linear units and dropout for regularization and this deep net fitted barely into my 6GB GPU memory. More details on my approach can be found here.

Should I get multiple GPUs?

Excited by what deep learning can do with GPUs I plunged myself into multi-GPU territory by assembling a small GPU cluster with InfiniBand 40Gbit/s interconnect. I was thrilled to see if even better results can be obtained with multiple GPUs.

GPU pic - Copy
Setup in my main computer: You can see three GXT Titan and an InfiniBand card. Is this a good setup for doing deep learning?

I quickly found that it is not only very difficult to parallelize neural networks on multiple GPU efficiently, but also that the speedup was only mediocre for dense neural networks. Small neural networks could be parallelized rather efficiently using data parallelism, but larger neural networks like I used in the Partly Sunny with a Chance of Hashtags Kaggle competition received almost no speedup.

However, using model parallelism, I was able to train neural networks that were much larger and that had almost 3 billion connections. But to leverage these connections one needs just much larger data sets than are normally used. I found some uses for that when I trained a language model on the entire Wikipedia corpus – but that’s about it.

On the other hand, one advantage of multiple GPUs is that you can run multiple algorithms or experiments separately on each GPU. This is highly useful if your main goal is to gain deep learning experience as quickly as possible. You gain no speedups, but you get more information of your performance by using different algorithms or parameters at once.

If you use deep learning only occasionally, or you use rather small data sets (smaller than say 10-15GB) and foremost dense neural networks, then multiple GPUs are probably not for you. However, when you use convolutional neural networks a lot, then multiple GPUs might still make sense.

Alex Krizhevsky released his new updated kernels which can run convolutional neural networks on up to four GPUs. Convolutional neural networks – unlike dense neural networks – can be run very efficiently on multiple GPUs because their use of weight sharing makes data parallelism very efficient. On top of that, Alex Krizhevsky’s implementation utilizes model parallelizm for the densely connected final layers of the network.

However, if you want to program similar networks for yourself, be aware that to program efficient convolutional kernels for multiple GPUs is a very difficult undertaking for which expert GPU programming skills are required.

So overall, one can say that one GPU should be sufficient for almost any task and that additional GPUs convey only benefits under very specific circumstances.

So what kind of GPU should I get?

Required memory size

People often ask me if the GPUs with the largest memory are best for them, as this would enable them to run the largest neural networks. I thought like this when I bought my GPU, a GTX Titan with 6GB memory. And I also thought that this was a good choice when my neural network in the Partly Sunny with a Chance of Hashtags Kaggle competition barely fitted in my GPU memory. But later I found out that my neural network implementation was very memory inefficient and much less memory would have been sufficient.

Generally, if you want to know how large a neural network you could fit into a given GPU memory then subtract 400 MB as a buffer, then divide the GPU memory by two for the momentum matrix, then multiply the memory in MB by {1024\times 1024} to get bytes, and finally divide by four get the number of floating point numbers fitting into that memory. For a GTX Titan this would be \frac{(6144-400)\times 1024\times 1024}{4\times 2} = 752.877.568 parameters. For the Kaggle competition I used a 9000x4000x4000x32 network, which are just 52.128.000 parameters (with momentum matrix). So this network would even fit into a small GPU with 1536 MB (space for about 148.897.792 parameters). For comparison, Alex Krizhevsky’s convolutional neural network for the ImageNet competition featured 60.000.000 parameters (120 million with momentum matrix).

So the bottom line is if you just want to use neural networks on some Kaggle data sets and build normal models, you should not worry too much about memory ({\geq 1536\mbox{MB}} is fine). For cutting edge models that are used on very large amounts of data you should have more than 3GB.

Update 2014-09-28: Sander Dieleman made me aware, that this last sentence should be emphasized: If you really want to work on large data sets, you should go for either 4GB or 6GB (depending on your budget). He also pointed out an error in my reasoning with parameters: Convolutional kernels need additional memory to run, i.e. they need additional memory beyond the parameters alone. So Alex’s net – even though it features 120 million parameters, does not fit into 3GB memory. So my calculation above are off for convolutional neural networks (which are arguably the most important neural networks to date). Otherwise this blog post contains all the information you need to make a fully informed choice for your deep learning GPU. I will update the post as more information regard GTX 980 performance is known (first currency mining tests indicate, that the GTX 980 will be the best GPU. Although it’s stats are bad, it’s architecture seems to make more than just up for it! More later).

Fastest GPU for a given budget

Processing performance is most often measured in floating-point operations per second (FLOPS). This measure is often advertised in GPU computing and it is also the measure which determines which supercomputer enters the TOP500 list of the fastest supercomputers. However, this measure is misleading, as it measures processing power on problems that do not occur in practice.

It turns out that the most important practical measure for GPU performance is bandwidth in GB/s, which measures how much memory can be read and written per second. This is because almost all mathematical operations, such as dot product, sum, addition etcetera, are bandwidth bound, i.e. limited by the GB/s  of the card rather than its FLOPS.

memory-bandwidth
Comparison of bandwidth for CPUs and GPUs over time. Bandwidth is one of
the main reasons why GPUs are faster for computing than CPUs are.

To determine the fastest GPU for a given budget one can use this Wikipedia page and look at Bandwidth in GB/s; the listed prices are quite accurate for newer cards (600 and 700 series), but the 400 and 500 series is significantly cheaper than the listed prices – especially if you buy those cards via eBay.

Another important factor to consider however, is that the Fermi architecture (400 and 500 series) is quite a bit faster than the newer Kepler architecture (600 and 700 series), so that for example the GTX 580 is faster than any GTX 600 series GPU. Only high end 700 series GPUs (770, 780, Titan) outpace the GTX 580.

So the only disadvantage a GTX 580 has is its smaller memory of 1.5 or 3GB – which as discussed above – is not so bad after all. With an eBay price of $150-200 the GTX 580 is rather cheap (compared to the 770, 780 and Titan, that is) while delivering excellent performance with a good amount of memory.

For special applications one can easily reason which GPU to choose by balancing ones required memory size, bandwidth in GB/s and the price of the GPU, and this reasoning will be solid for many years to come. But right now and more generally, choosing a GTX 580 will be the best and most cost-effective choice for most deep learning application that does not involve (very) large neural networks.

 

[Image source: NVIDIA CUDA/C Programming Guide]