How GPUs accelerate deep learning

栏目: IT技术 · 发布时间: 5年前

内容简介：Why did the deep learning revolution had to wait decades?One major reason was the computational cost. Even the smallest architectures can have dozens of layers and millions of parameters, so repeatedly calculating gradients during is computationally expens

The embarrassingly parallel nature of neural networks

Tivadar Danka

Jun 23 ·7min read

N eural networks and deep learning are not recent methods. In fact, they are quite old. Perceptrons, the first neural networks, were created in 1958 by Frank Rosenblatt. Even the invention of the ubiquitous building blocks of deep learning architectures happened mostly near the end of the 20th century. For example, convolutional networks were introduced in 1989 in the landmark paper Backpropagation Applied to Handwritten Zip Code Recognition by Yann LeCun et al.

Why did the deep learning revolution had to wait decades?

One major reason was the computational cost. Even the smallest architectures can have dozens of layers and millions of parameters, so repeatedly calculating gradients during is computationally expensive. On large enough datasets, training used to take days or even weeks. Nowadays, you can train a state of the art model in your notebook under a few hours.

There were three major advances which brought deep learning from a research tool to a method present in almost all areas of our life. These are backpropagation , stochastic gradient descent and GPU computing . In this post, we are going to dive into the latter and see that neural networks are actually embarrassingly parallel algorithms, which can be leveraged to improve computational costs by orders of magnitude.

A big pile of linear algebra

Deep neural networks may seem complicated for the first glance. However, if we zoom into them, we can see that its components are pretty simple in most cases. As the always brilliant xkcd puts it, a network is (mostly) a pile of linear algebra.

How GPUs accelerate deep learning — Source: xkcd

During training, the most commonly used functions are the basic linear algebra operations such as matrix multiplication and addition. The situation is simple: if you call a function a bazillion times, shaving off just the tiniest amount of the time from the function call can compound to a serious amount.

Using GPU-s not only provide a small improvement here, they supercharge the entire process. To see how it is done, let’s consider activations for instance.

Suppose that φ is an activation function such as ReLU or Sigmoid. Applied to the output of the previous layer

the result is

(The same goes for multidimensional input such as images.)

This requires to loop over the vector and calculate the value for each element. There are two ways to make this computation faster. First, we can calculate each φ(xᵢ) faster. Second, we can calculate the values φ(x ₁ ), φ(x ₂ ), …, φ(x ₙ ) simultaneously, in parallel. In fact, this is embarrassingly parallel , which means that the computation can be parallelized without any significant additional effort.

Over the years, doing things faster became much more difficult. A processor’s clock speed used to double almost every year, but this has plateaued recently. Modern processor design has reached a point where packing more transistors into the units has quantum-mechanical barriers.

However, calculating the values in parallel does not require faster processors, just more of them. This is how GPUs work, as we are going to see.

The principles of GPU computing

Graphics Processing Units, or GPUs in short were developed to create and process images. Since the value of every pixel can be calculated independently of others, it is better to have a lot of weaker processors than a single very strong one doing the calculations sequentially.

This is the same situation we have for deep learning models. Most operations can be easily decomposed to parts which can be completed independently.

To give you an analogy, let’s consider a restaurant, which has to produce French fries on a massive scale. To do this, workers must peel, slice and fry the potato. Hiring people to peel the potatoes costs much more than purchasing many more kitchen robots capable to perform this task. Even if the robots are slower, you can buy much more from the budget, so overall the process will be faster.

Modes of parallelism

When talking about parallel programming, one can classify the computing architectures into four different classes. This was introduced by Michael J. Flynn in 1966 and it is in use ever since.

S ingle I nstruction, S ingle D ata (SISD)
S ingle I nstruction, M ultiple D ata (SIMD)
M ultiple I nstructions, S ingle D ata (MISD)
M ultiple I nstructions, M ultiple D ata (MIMD)

A multi-core processor is MIMD, while GPUs are SIMD. Deep learning is a problem for which SIMD is very well suited. When you calculate the activations, the same exact operation needs to be performed, with different data for each call.

Latency vs throughput

To give a more detailed picture on what GPU better than CPU, we need to take a look into latency and throughput . Latency is the time required to complete a single task, while throughput is the number of tasks completed per unit time.

Simply put, a GPU can provide much better throughput, at the cost of latency. For embarrassingly parallel tasks such as matrix computations, this can offer an order of magnitude improvement in performance. However, it is not well suited for complex tasks, such as running an operating system.

CPU, on the other hand, is optimized for latency, not throughput. They can do much more than floating point calculations.

General purpose GPU programming

In practice, general purpose GPU programming was not available for a long time. GPU-s were restricted to do graphics, and if you wanted to leverage their processing power, you needed to learn graphics programming languages such as OpenGL. This was not very practical and the barrier of entry was high.

This was the case until 2007, when nVidia launched the CUDA framework, an extension of C, which provides an API for GPU computing. This significantly flattened the learning curve for users. Fast forward a few years: modern deep learning frameworks use GPUs without us explicitly knowing about it.

GPU computing for deep learning

So, we have talked about how GPU computing can be used for deep learning, but we haven’t seen the effects. The following table shows a benchmark, which was made in 2017. Although it was made three years ago, it still demonstrates the order of magnitude improvement in speed.

How modern deep learning frameworks use GPUs

Programming directly in CUDA and writing kernels by yourself is not the easiest thing to do. Thankfully, modern deep learning frameworks such as TensorFlow and PyTorch doesn’t require you to do that. Behind the scenes, the computationally intensive parts are written in CUDA using its deep learning library cuDNN . These are called from Python, so you don’t need to use them directly at all. Python is really strong in this aspect: it can be combined with C easily, which gives you both the power and the ease of use.

This is similar to how NumPy works behind the scenes: it is blazing fast because its functions are written directly in C.

Is NumPy really faster than Python?

Yes, but only if you know how to use it.

towardsdatascience.com

Do you need to build a deep learning rig?

If you want to train deep learning models on your own, you have several choices. First, you can build a GPU machine for yourself, however, this can be a significant investment. Thankfully, you don’t need to do that: cloud providers such as Amazon and Google offer remote GPU instances to work on. If you want to access resources for free, check out Google Colab , which offers free access to GPU instances.

Conclusion

Deep learning is computationally very intensive. For decades, training neural networks was limited by hardware. Even relatively smaller models had to be trained for days, and training large architectures on huge datasets was impossible.

However, with the appearance of general computing GPU programming, deep learning exploded. GPUs excel in parallel programming, and since these algorithms can be parallelized very efficiently, it can accelerate training and inference by several orders of magnitude.

This has opened the way for rapid growth. Now, even relatively cheap commercially available computers can train state of the art models. Combined with the amazing open source tools such as TensorFlow and PyTorch, people are building awesome things every day. This is truly a great time to be in the field.

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

精通数据科学：从线性回归到深度学习

唐亘 / 人民邮电出版社 / 2018-5-8 / 99.00元

数据科学是一门内涵很广的学科，它涉及到统计分析、机器学习以及计算机科学三方面的知识和技能。本书深入浅出、全面系统地介绍了这门学科的内容。本书分为13章，最初的3章主要介绍数据科学想要解决的问题、常用的IT工具Python以及这门学科所涉及的数学基础。第4-7章主要讨论数据模型，主要包含三方面的内容：一是统计中最经典的线性回归和逻辑回归模型；二是计算机估算模型参数的随机梯度下降法，这是模型工......一起来看看《精通数据科学：从线性回归到深度学习》这本书的介绍吧!

码农工具