内容简介:Deep Learning’sIn this post, I will try to list all these arguments. This post is for you if you want to see their impact on theAll GIFs are made with python. You will be able to test each of these arguments and
A visual and mathematical explanation of the 2D convolution layer and its arguments
May 2 ·9min read
Introduction
Deep Learning’s libraries and platforms such as Tensorflow , Keras , Pytorch , Caffe or Theano help us in our daily lives so that every day new applications make us think “Wow!”. We all have our favorite framework, but what they all have in common is that they make things easy for us with functions that are easy to use that can be configured as needed. But we still need to understand what the arguments available are to take advantage of all the power these frameworks give us.
In this post, I will try to list all these arguments. This post is for you if you want to see their impact on the computation time , the number of trainable parameters and the size of the convolved output channels.
All GIFs are made with python. You will be able to test each of these arguments and visualize by yourself their impact with the scripts pushed on my Github (or to make your own GIFs).
The parts of this post will be divided according to the following arguments. These arguments can be found in the Pytorch documentation of the Conv2d module :
- in_channels ( int ) — Number of channels in the input image
- out_channels ( int ) — Number of channels produced by the convolution
- kernel_size ( int or tuple ) — Size of the convolving kernel
- stride ( int or tuple , optional ) — Stride of the convolution. Default: 1
- padding ( int or tuple , optional ) — Zero-padding added to both sides of the input. Default: 0
- dilation ( int or tuple , optional ) — Spacing between kernel elements. Default: 1
- groups ( int , optional ) — Number of blocked connections from input channels to output channels. Default: 1
- bias ( bool , optional ) — If
True
, adds a learnable bias to the output. Default:True
Finally, we will have all the keys to calculate the size of the output channels according to the arguments and the size of the input channels.
What is a Kernel?
Let me introduce what a kernel is (or convolution matrix ). A kernel describes a filter that we are going to pass over an input image. To make it simple, the kernel will move over the whole image, from left to right, from top to bottom by applying a convolution product . The output of this operation is called a filtered image .
To take a very basic example, let’s imagine a 3 by 3 convolution kernel filtering a 9 by 9 image. Then this kernel moves all over the image to capture in the image all squares of the same size (3 by 3). The convolution product is an element-wise (or point-wise) multiplication. The sum of this result is the resulting pixel on the output (or filtered) image.
If you are not already familiar with filters and convolution matrices, then I strongly advise you to take a little more time to understand the convolution kernels. They are the core of the 2D convolution layer .
Trainable Parameters and Bias
The trainable parameters , which are also simply called “parameters”, are all the parameters that will be updated when the network is trained. In a Conv2d, the trainable elements are the values that compose the kernels . So for our 3 by 3 convolution kernel, we have 3*3=9 trainable parameters.
To be more complete. We can include bias or not. The role of bias is to be added to the sum of the convolution product. This bias is also a trainable parameter which makes the number of trainable parameters for our 3 by 3 kernel rise to 10.
Number of Input and Output Channels
The benefit of using a layer is to be able to perform similar operations at the same time. In other words, if we want to apply 4 different filters of the same size to an input channel, then we will have 4 output channels. These channels are the result of 4 different filters. So resulting from 4 distinct kernels .
In the previous section, we saw that the trainable parameters are what make up the convolution kernels. So the number of parameters increases linearly with the number of convolution kernels. Hence linearly with the number of desired output channels. Note also that the computing time also varies proportionally with the size of the input channel and proportionally with the number of kernels.
The same principle applies to the number of input channels . Let’s consider the situation of an RGB encoded image. This image has 3 channels: red, blue and green. We can decide to extract information with filters of the same size on each of these 3 channels to obtain four new channels. The operation is thus 3 times the same, for 4 output channels.
Each output channel is the sum of the filtered input channels.For 4 output channels and 3 input channels, each output channel is the sum of 3 filtered input channels. In other words, the convolution layer is composed of 4*3=12 convolution kernels.
As a reminder, the number of parameters and the computation time changes proportionally to the number of output channels. This is due to the fact that each output channel is linked to kernels distinct from the other channels. The same is true for the number of input channels . The calculation time and the number of parameters grows proportionally.
Kernel size
So far, all examples have been given with 3 by 3 size kernels. In fact, the choice of its size is entirely up to you . It is possible to create a convolution layer with a core size 1*1 or 19*19.
But it is also absolutely possible not to have square kernels. It is a possibility to decide to have kernels with different heights and widths . This is often the case in signal image analysis. If we know that we want to scan the image of a signal, of a sound, then we may want to prefer a 5*1 size kernel for example.
Finally, you will have noticed that all sizes are defined by an odd number. It is just as acceptable to define an even kernel size. In practice, this is rarely done. Usually, an odd size kernel is chosen because there is symmetry around a central pixel.
Since all the (classical) trainable parameters of a convolution layer are in the kernels, the number of parameters grows linearly with the size of the kernels. The computation time also varies proportionally.
Strides
The kernels, by default, move from left to right, from bottom to top from pixel to pixel. But this movement can also be changed. Often used to down sample the output channel. For example with strides of (1, 3), the filter is shifted from 3 to 3 horizontally and from 1 to 1 vertically. This produces output channels down sampled by 3 horizontally.
The strides have no impact on the number of parameters but the calculation time, logically, decreases linearly with the strides.
Padding
The padding defines the number of pixels added to the sides of the input channels before their convolution filtering. Usually, the padding pixels are set to zero. The input channel is extended .
This is very useful when you want the size of the output channels to be equal to the size of the input channels. To make it simple, when the kernel is 3*3 then the output channel size decreases by one on each side. To overcome this problem we can use a padding of 1.
The padding, therefore, has no impact on the number of parameters, but generates an additional calculation time proportional to the size of the padding. But generally speaking, the padding is often small enough in comparison to the size of the input channel to consider there is no impact on the computation time.
Dilation
The dilation is, in a way, the width of the core. By default equal to 1, it corresponds to the offset between each pixel of the kernel on the input channel during convolution .
I exaggerated a bit on my GIF, but if we take the example of a dilation of (4, 2) then the receptive field of the kernel on the input channel is widened by 4 * ( 3 -1)=8 vertically and 2 * (3–1)=4 horizontally (for a kernel of 3 by 3).
Just like padding, dilation has no impact on the number of parameters and very limited impact on the calculation time.
Groups
Groups can be very useful in specific cases. For example, if we have several concatenated data sources. When it is not necessary to treat them dependent on each other. Input channels can be grouped and processed independently . Finally, the output channels are concatenated at the end.
If there are 2 input channels and 4 output channels with 2 groups. Then this is like dividing the input channels into two groups (so 1 input channel in each group) and making it go through a convolution layer with half as many output channels. The output channels are then concatenated.
It is important to note two things. Firstly, the number of groups must perfectly divide the number of input channels and the number of output channels ( common divisor ). Secondly, the kernels are shared with each group.
The number of parameters is therefore divided by the number of groups. Concerning the computation time with Pytorch, the algorithm is optimized for groups and therefore should reduce the computation time. However, it should also be taken into account that must sum up with the calculation time for group formation and concatenation of the output channels.
Output Channel Size
With the knowledge of all the arguments, the size of the output channels can be calculated from the size of the input channels.
Sources
Deep Learning Tutorial , Y. LeCun
Documentation torch.nn , Pytorch
Convolutional Neural Networks , cs231n
Convolutional Layers , Keras
All the images are home made
All computation time tests have been run with Pytorch, on my GPU (GeForce GTX 960M) and are available on this GitHub repository if you want to run them yourself or perform alternative tests.
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。