How GPUs accelerate deep learning

栏目: IT技术 · 发布时间: 4年前

内容简介:Why did the deep learning revolution had to wait decades?One major reason was the computational cost. Even the smallest architectures can have dozens of layers and millions of parameters, so repeatedly calculating gradients during is computationally expens

The embarrassingly parallel nature of neural networks

N eural networks and deep learning are not recent methods. In fact, they are quite old. Perceptrons, the first neural networks, were created in 1958 by Frank Rosenblatt. Even the invention of the ubiquitous building blocks of deep learning architectures happened mostly near the end of the 20th century. For example, convolutional networks were introduced in 1989 in the landmark paper Backpropagation Applied to Handwritten Zip Code Recognition by Yann LeCun et al.

Why did the deep learning revolution had to wait decades?

One major reason was the computational cost. Even the smallest architectures can have dozens of layers and millions of parameters, so repeatedly calculating gradients during is computationally expensive. On large enough datasets, training used to take days or even weeks. Nowadays, you can train a state of the art model in your notebook under a few hours.

There were three major advances which brought deep learning from a research tool to a method present in almost all areas of our life. These are backpropagation , stochastic gradient descent and GPU computing . In this post, we are going to dive into the latter and see that neural networks are actually embarrassingly parallel algorithms, which can be leveraged to improve computational costs by orders of magnitude.

A big pile of linear algebra

Deep neural networks may seem complicated for the first glance. However, if we zoom into them, we can see that its components are pretty simple in most cases. As the always brilliant xkcd puts it, a network is (mostly) a pile of linear algebra.

How GPUs accelerate deep learning

Source: xkcd

During training, the most commonly used functions are the basic linear algebra operations such as matrix multiplication and addition. The situation is simple: if you call a function a bazillion times, shaving off just the tiniest amount of the time from the function call can compound to a serious amount.

Using GPU-s not only provide a small improvement here, they supercharge the entire process. To see how it is done, let’s consider activations for instance.

Suppose that φ is an activation function such as ReLU or Sigmoid. Applied to the output of the previous layer

How GPUs accelerate deep learning

the result is

How GPUs accelerate deep learning

(The same goes for multidimensional input such as images.)

This requires to loop over the vector and calculate the value for each element. There are two ways to make this computation faster. First, we can calculate each φ(xᵢ) faster. Second, we can calculate the values φ(x), φ(x), …, φ(x) simultaneously, in parallel. In fact, this is embarrassingly parallel , which means that the computation can be parallelized without any significant additional effort.

Over the years, doing things faster became much more difficult. A processor’s clock speed used to double almost every year, but this has plateaued recently. Modern processor design has reached a point where packing more transistors into the units has quantum-mechanical barriers.

However, calculating the values in parallel does not require faster processors, just more of them. This is how GPUs work, as we are going to see.

The principles of GPU computing

Graphics Processing Units, or GPUs in short were developed to create and process images. Since the value of every pixel can be calculated independently of others, it is better to have a lot of weaker processors than a single very strong one doing the calculations sequentially.

This is the same situation we have for deep learning models. Most operations can be easily decomposed to parts which can be completed independently.

How GPUs accelerate deep learning

nVidia Fermi architecture. There has been many improvements to this, but it illustrates the point well. Source: nVidia Fermi architecture whitepaper

To give you an analogy, let’s consider a restaurant, which has to produce French fries on a massive scale. To do this, workers must peel, slice and fry the potato. Hiring people to peel the potatoes costs much more than purchasing many more kitchen robots capable to perform this task. Even if the robots are slower, you can buy much more from the budget, so overall the process will be faster.

Modes of parallelism

When talking about parallel programming, one can classify the computing architectures into four different classes. This was introduced by Michael J. Flynn in 1966 and it is in use ever since.

  1. S ingle I nstruction, S ingle D ata (SISD)
  2. S ingle I nstruction, M ultiple D ata (SIMD)
  3. M ultiple I nstructions, S ingle D ata (MISD)
  4. M ultiple I nstructions, M ultiple D ata (MIMD)

A multi-core processor is MIMD, while GPUs are SIMD. Deep learning is a problem for which SIMD is very well suited. When you calculate the activations, the same exact operation needs to be performed, with different data for each call.

Latency vs throughput

To give a more detailed picture on what GPU better than CPU, we need to take a look into latency and throughput . Latency is the time required to complete a single task, while throughput is the number of tasks completed per unit time.

Simply put, a GPU can provide much better throughput, at the cost of latency. For embarrassingly parallel tasks such as matrix computations, this can offer an order of magnitude improvement in performance. However, it is not well suited for complex tasks, such as running an operating system.

CPU, on the other hand, is optimized for latency, not throughput. They can do much more than floating point calculations.

General purpose GPU programming

In practice, general purpose GPU programming was not available for a long time. GPU-s were restricted to do graphics, and if you wanted to leverage their processing power, you needed to learn graphics programming languages such as OpenGL. This was not very practical and the barrier of entry was high.

This was the case until 2007, when nVidia launched the CUDA framework, an extension of C, which provides an API for GPU computing. This significantly flattened the learning curve for users. Fast forward a few years: modern deep learning frameworks use GPUs without us explicitly knowing about it.

GPU computing for deep learning

So, we have talked about how GPU computing can be used for deep learning, but we haven’t seen the effects. The following table shows a benchmark, which was made in 2017. Although it was made three years ago, it still demonstrates the order of magnitude improvement in speed.

How GPUs accelerate deep learning

CPU vs GPU benchmarks for various deep learning frameworks. (The benchmark is from 2017, so it considers the state of the art back from that time. However, the point still stands: GPU outperforms CPU for deep learning.) Source: Benchmarking State-of-the-Art Deep Learning Software Tools

How modern deep learning frameworks use GPUs

Programming directly in CUDA and writing kernels by yourself is not the easiest thing to do. Thankfully, modern deep learning frameworks such as TensorFlow and PyTorch doesn’t require you to do that. Behind the scenes, the computationally intensive parts are written in CUDA using its deep learning library cuDNN . These are called from Python, so you don’t need to use them directly at all. Python is really strong in this aspect: it can be combined with C easily, which gives you both the power and the ease of use.

This is similar to how NumPy works behind the scenes: it is blazing fast because its functions are written directly in C.

Do you need to build a deep learning rig?

If you want to train deep learning models on your own, you have several choices. First, you can build a GPU machine for yourself, however, this can be a significant investment. Thankfully, you don’t need to do that: cloud providers such as Amazon and Google offer remote GPU instances to work on. If you want to access resources for free, check out Google Colab , which offers free access to GPU instances.

Conclusion

Deep learning is computationally very intensive. For decades, training neural networks was limited by hardware. Even relatively smaller models had to be trained for days, and training large architectures on huge datasets was impossible.

However, with the appearance of general computing GPU programming, deep learning exploded. GPUs excel in parallel programming, and since these algorithms can be parallelized very efficiently, it can accelerate training and inference by several orders of magnitude.

This has opened the way for rapid growth. Now, even relatively cheap commercially available computers can train state of the art models. Combined with the amazing open source tools such as TensorFlow and PyTorch, people are building awesome things every day. This is truly a great time to be in the field.


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

用户中心设计

用户中心设计

(美国)弗登伯格等编 / 高等教育出版社 / 2003-8 / 20.00元

本书以用户对最终产品或系统的所见及所感为出发点考虑设计方法,所涉及的产品从数据库软件到语音识别软件,在众多项目(医疗保健、金融证券、航空事业、保险业、汽车制造业及零售业等)中得到验证。内容包括:能带来突破性增益的针对UCD的完整的周期化方法;现有产品评测、机构评定以使其适用UCD方法;提高用户感知舒适度;在外延型/内适型应用环境下的软件设计、硬件设计、网站建设和服务中应用UCD;当前UCD优化及未......一起来看看 《用户中心设计》 这本书的介绍吧!

SHA 加密
SHA 加密

SHA 加密工具

正则表达式在线测试
正则表达式在线测试

正则表达式在线测试

HEX CMYK 转换工具
HEX CMYK 转换工具

HEX CMYK 互转工具