Why should I get a GPU?
I have been using GPUs for nearly two years now and for me it is again and again amazing to see how much speedup you get. Compared to CPUs 20x speedups are typical, but on larger problems one can achieve 50x speedups. With GPUs you can try out new ideas, algorithms and experiments much faster than usual and get almost immediate feedback as to what works and what does not. This is a very important aspect when one begins to do deep learning as this rapid gain in practical experience is key to build the expertise with which you can make deep learning work on new problems. Without this rapid feedback it just takes too much time to learn from one’s mistakes and it can be discouraging and frustrating to go on with deep learning.
With GPUs I quickly learned how to apply deep learning on a range of Kaggle competitions and I managed to earn second place in the Partly Sunny with a Chance of Hashtags Kaggle competition, where it was the task to predict weather ratings for a given tweet. In the competition I used a rather large two layered deep neural network with rectified linear units and dropout for regularization and this deep net fitted barely into my 6GB GPU memory. More details on my approach can be found here.
Should I get multiple GPUs?
Excited by what deep learning can do with GPUs I plunged myself into multi-GPU territory by assembling a small GPU cluster with InfiniBand 40Gbit/s interconnect. I was thrilled to see if even better results can be obtained with multiple GPUs.

I quickly found that it is not only very difficult to parallelize neural networks on multiple GPU efficiently, but also that the speedup was only mediocre for dense neural networks. Small neural networks could be parallelized rather efficiently using data parallelism, but larger neural networks like I used in the Partly Sunny with a Chance of Hashtags Kaggle competition received almost no speedup.
However, using model parallelism, I was able to train neural networks that were much larger and that had almost 3 billion connections. But to leverage these connections one needs just much larger data sets than are normally used. I found some uses for that when I trained a language model on the entire Wikipedia corpus – but that’s about it.
On the other hand, one advantage of multiple GPUs is that you can run multiple algorithms or experiments separately on each GPU. This is highly useful if your main goal is to gain deep learning experience as quickly as possible. You gain no speedups, but you get more information of your performance by using different algorithms or parameters at once.
If you use deep learning only occasionally, or you use rather small data sets (smaller than say 10-15GB) and foremost dense neural networks, then multiple GPUs are probably not for you. However, when you use convolutional neural networks a lot, then multiple GPUs might still make sense.
Alex Krizhevsky released his new updated kernels which can run convolutional neural networks on up to four GPUs. Convolutional neural networks – unlike dense neural networks – can be run very efficiently on multiple GPUs because their use of weight sharing makes data parallelism very efficient. On top of that, Alex Krizhevsky’s implementation utilizes model parallelizm for the densely connected final layers of the network.
However, if you want to program similar networks for yourself, be aware that to program efficient convolutional kernels for multiple GPUs is a very difficult undertaking for which expert GPU programming skills are required.
So overall, one can say that one GPU should be sufficient for almost any task and that additional GPUs convey only benefits under very specific circumstances.
So what kind of GPU should I get?
Required memory size
People often ask me if the GPUs with the largest memory are best for them, as this would enable them to run the largest neural networks. I thought like this when I bought my GPU, a GTX Titan with 6GB memory. And I also thought that this was a good choice when my neural network in the Partly Sunny with a Chance of Hashtags Kaggle competition barely fitted in my GPU memory. But later I found out that my neural network implementation was very memory inefficient and much less memory would have been sufficient.
Generally, if you want to know how large a neural network you could fit into a given GPU memory then subtract 400 MB as a buffer, then divide the GPU memory by two for the momentum matrix, then multiply the memory in MB by to get bytes, and finally divide by four get the number of floating point numbers fitting into that memory. For a GTX Titan this would be
parameters. For the Kaggle competition I used a 9000x4000x4000x32 network, which are just 52.128.000 parameters (with momentum matrix). So this network would even fit into a small GPU with 1536 MB (space for about 148.897.792 parameters). For comparison, Alex Krizhevsky’s convolutional neural network for the ImageNet competition featured 60.000.000 parameters (120 million with momentum matrix).
So the bottom line is if you just want to use neural networks on some Kaggle data sets and build normal models, you should not worry too much about memory ( is fine). For cutting edge models that are used on very large amounts of data you should have more than 3GB.
Update 2014-09-28: Sander Dieleman made me aware, that this last sentence should be emphasized: If you really want to work on large data sets, you should go for either 4GB or 6GB (depending on your budget). He also pointed out an error in my reasoning with parameters: Convolutional kernels need additional memory to run, i.e. they need additional memory beyond the parameters alone. So Alex’s net – even though it features 120 million parameters, does not fit into 3GB memory. So my calculation above are off for convolutional neural networks (which are arguably the most important neural networks to date). Otherwise this blog post contains all the information you need to make a fully informed choice for your deep learning GPU. I will update the post as more information regard GTX 980 performance is known (first currency mining tests indicate, that the GTX 980 will be the best GPU. Although it’s stats are bad, it’s architecture seems to make more than just up for it! More later).
Fastest GPU for a given budget
Processing performance is most often measured in floating-point operations per second (FLOPS). This measure is often advertised in GPU computing and it is also the measure which determines which supercomputer enters the TOP500 list of the fastest supercomputers. However, this measure is misleading, as it measures processing power on problems that do not occur in practice.
It turns out that the most important practical measure for GPU performance is bandwidth in GB/s, which measures how much memory can be read and written per second. This is because almost all mathematical operations, such as dot product, sum, addition etcetera, are bandwidth bound, i.e. limited by the GB/s of the card rather than its FLOPS.

the main reasons why GPUs are faster for computing than CPUs are.
To determine the fastest GPU for a given budget one can use this Wikipedia page and look at Bandwidth in GB/s; the listed prices are quite accurate for newer cards (600 and 700 series), but the 400 and 500 series is significantly cheaper than the listed prices – especially if you buy those cards via eBay.
Another important factor to consider however, is that the Fermi architecture (400 and 500 series) is quite a bit faster than the newer Kepler architecture (600 and 700 series), so that for example the GTX 580 is faster than any GTX 600 series GPU. Only high end 700 series GPUs (770, 780, Titan) outpace the GTX 580.
So the only disadvantage a GTX 580 has is its smaller memory of 1.5 or 3GB – which as discussed above – is not so bad after all. With an eBay price of $150-200 the GTX 580 is rather cheap (compared to the 770, 780 and Titan, that is) while delivering excellent performance with a good amount of memory.
For special applications one can easily reason which GPU to choose by balancing ones required memory size, bandwidth in GB/s and the price of the GPU, and this reasoning will be solid for many years to come. But right now and more generally, choosing a GTX 580 will be the best and most cost-effective choice for most deep learning application that does not involve (very) large neural networks.
[Image source: NVIDIA CUDA/C Programming Guide]
[…] There are a number of sites with good information on choosing a GPU for machine learning. The first one I came across at FastML called “Running things on a GPU”. Recently I came across another site with good advice also. The page is “Which GPU(s) to Get for Deep Learning”. […]
LikeLike
How much slower mid-level GPUs are? For example, I have a Mac with GeForce 750M, is it suitable for training DNN models?
LikeLike
There is a GT 750M version with DDR3 memory and GDDR5 memory; the GDDR5 memory will be about thrice as fast as the DDR3 version. With a GDDR5 model you probably will run three to four times slower than typical desktop GPUs but you should see a good speedup of 5-8x over a desktop CPU as well. So a GDDR5 750M will be sufficient for running most deep learning models. If you have the DDR3 version, then it might be too slow for deep learning (smaller models might take a day; larger models a week or so).
LikeLike
is it any good for processing non-mathematical data or non-floating point via GPU? How about the handling of generating hashes and keypairs?
LikeLike
Sometime it is good, but often it isn’t – it depends on the use-case. One applications of GPUs for hash generation is bitcoin mining. However the main measure of success in bitcoin mining (and cryptocurrency mining in general) is to generate as many hashes per watt of energy; GPUs are in the mid-field here, beating CPUs but are beaten by FPGA and other low-energy hardware.
In the case of keypair generation, e.g. in mapreduce, you often do little computation, but lots of IO operations so that GPUs cannot be utilized efficiently. For many applications GPUs are significantly faster in one case, but not in another similar case, e.g. for some but not all regular expressions, and this is the main problem why GPUs are not used in other cases.
LikeLike
Hi, nice writeup! Are you using single or double precision floats? You said divide by 4 for the byte size, which sounds like 32 bit floats, but then you point out that the Fermi cards are better than Kepler, which is more true when talking about double precision than single, as the Fermi cards have FP64 at 1/8th of FP32 while Kepler is 1/24th. Trying to decide myself whether to go with the cheaper Geforce cards or to spring for a Titan.
LikeLike
Thanks for you comment James. Yes, deep learning is generally done with single precision computation, as the gains in precision do not improve the results greatly.
It depends what types of neural network you want to train and how large they are. But I think a good decision would be to go for a 3GB GTX580 from ebay, and then upgrade to a GTX 1000 series card next year. The GTX 1000 series cards will probably be quite good for deep learning, so waiting for them might be a wise choice.
LikeLike
[…] vision. Reading an endless stack of papers, building sexy black-on-black computers with powerful GPUs, running late-night […]
LikeLike
Thank you for the great post. Could you say something about having a new card on order CPU?
For example I have 4 core Intel Q6600 from year 2007 with 8Gb of RAM (without possibility to upgrade). Could this be a bottleneck, if I choose to buy new GPU for CUDA and ML?
I’m also not sure which one is a better choice GTX 780 2Gb of RAM, vs GTX 970 4Gb of RAM. 780 has more cores, but are a bit slower…
http://www.game-debate.com/gpu/index.php?gid=2438&gid2=880&compare=geforce-gtx-970-4gb-vs-geforce-gtx-780
A nice list of characteristics, still, I’m not sure which would be a better choice. I would use the GPU for all kind of problems, perhaps some with smaller networks, but I wouldn’t be shy of trying something bigger when I feel conferrable enough.
What would you recommend?
LikeLike
Hi enedene, thanks for your question!
Your CPU should be sufficient and should slow you down only slightly (1-10%).
My post is now a bit outdated as the new Maxwell GPUs have been released. The Maxwell architecture is much better than the Kepler architecture and so the GTX 970 is faster than the GTX 780 even though it has lower bandwidth. So I would recommend getting a GTX 970 over a GTX 780 (of course, a GTX 980 would be better still, but a GTX 970 will be fine for most things, even for larger nets).
For low budgets I would still recommend a GTX 580 from eBay.
I will update my post next week to reflect the new information.
LikeLike
Thank you for the quick reply. I will most probably get GTX 970. Looking forward to your updated post, and competing against your on Kaggle. :)
LikeLike