Deep Learning for Computer Vision

栏目: IT技术 · 发布时间: 3年前

内容简介:Deep learning in computer vision has made rapid progress over a short period. Some of the applications where deep learning is used in computer vision include face recognition systems, self-driving cars, etc.This article introduces convolutional neural netw

Understanding Convolutional, Pooling and Fully Connected layers of CNN

Deep Learning for Computer Vision

May 26 ·12min read

Deep learning in computer vision has made rapid progress over a short period. Some of the applications where deep learning is used in computer vision include face recognition systems, self-driving cars, etc.

This article introduces convolutional neural networks, also known as convnets , a type of deep-learning model universally used in computer vision applications. We’ll be deep-diving into understanding its components, layers like convolutional layer, pooling layers, and fully connected layers and how they can be applied to solve various problems.

So let’s get started.

The objective of CNNs:

Deep Learning for Computer Vision
Cat

Above is an image of a cat, as a kid we are told that this animal is a cat. As we grow up learning and seeing more images of cats, our brain registers the various features of it like the shape of eyes, ears, facial structure, whiskers, etc. and next time we see an image of an animal possessing those features, we’re able to predict it as a cat because that is something we’ve learned with experience.

Now we need to emulate the same behavior to computers. Such a class of problem is known as an image classification problem in computer vision, where we try to identify the object present in the image.

Given an image, first computers should be able to extract its features and then based on those features, predict as to what that object is. How can they achieve that? Well, the short answer is by using CNNs. How they do it is something we’ll deep dive into.

Introduction to CNNs:

So the objective of CNN is to perform 2 tasks: first is feature extraction and second is aggregating all the extracted features and making a prediction based on it.

Before we deep dive into individual components, let’s see how CNN looks like.

Deep Learning for Computer Vision

Sample CNN Architecture depicting Conv, Pooling and Fully Connected Layers Source: https://www.learnopencv.com/wp-content/uploads/2017/11/cnn-schema1.jpg

From the above image, we can see three types of layers namely: Convolutional Layer, Pooling Layer, and Fully Connected Layer . (We’ll be discussing these in following sections)

It’s totally fine if the above image does not make complete sense. The reason we’re seeing this is to have a mental image of how CNN looks like so that it becomes easy for us to connect the dots once we’re done understanding its various layers.

Convolutional Operation:

Convolution is one of the fundamental building blocks of CNNs. The prime objective of the convolutional operation is to extract features like edges, curves, corners, gradient orientation, etc from the input image. We’ll understand the convolution operation with an edge detection example.

Given an image and we want to extract all the horizontal and vertical edges in that image. The below image depicts the same.

Deep Learning for Computer Vision

Example of Convolutional Operation

Consider we have a grayscale image of 6x6. Now to detect edges in that image, we construct a 3x3 matrix. In CNN terminology it’s called a filter or a kernel. Using these two matrices, we’ll perform the convolutional operation. The resultant matrix i.e. the output of the convolution operation will be a matrix of size 4x4. The below figure depicts the same.

Deep Learning for Computer Vision

Example of convolutional operation

Now, how we compute the elements of the resulting 4x4 matrix is as follows:

To compute the top-left element, we are going to take the 3x3 filter and paste it on top of the 3x3 region of the original input image. Next, we’ll do the element-wise product to give us the desired value.

Deep Learning for Computer Vision

Next, to figure out the second element, we’re gonna take the filter i.e. the yellow square and shift it one step to the right and do the same element-wise product and then add them together. Similarly we can fill all the elements of that row.

Deep Learning for Computer Vision

Sliding filter one step to the right to get desired values

Now to get the element in the next row, we’ll shift the filter one down to the next row and repeat the same element-wise product and adding them together. Therefore we can fill in the rest of the elements likewise. Below shows us the final result.

Deep Learning for Computer Vision

Convolutional Operation

A couple of points here. A 6x6 matrix when convolved using a 3x3 matrix gives us a 4x4 matrix. These are essentially matrices. But the matrix on the left is convenient to interpret as the input image, one in the middle to interpret as a filter and one on the right to interpret as output feature.

The output feature dimension is calculated as follows:

n x n input image       f x f filteroutput dimension = (n - f + 1)Above example:        6 x 6 input image       3 x 3 filter
(6 - 3 + 1) x (6 - 3 + 1)
output dimensions = 4 x 4

NOTE: The values of our filters are often called weights. How we are deciding the values of weights is something learned during the training. They are initialized with some random values and keep adjusting with every training step.

Padding

Every time we perform convolution operation, we are losing some of the information present in the border pixels. Also, our image shrinks a little in size. There are times when we would want to reduce the output size to save resource consumption during training. However, there might be times when we would want to keep the spatial dimensions of output and input the same. To accomplish that, we can use the concept of padding.

Padding is nothing but adding an appropriate number of rows and columns on each side of the input feature. Padding essentially makes the feature maps produced by the filters the same size as the original image.

Deep Learning for Computer Vision

Convolution with padding, p = 1 Source: https://github.com/vdumoulin/conv_arithmetic

In the figure (left), a blue square of dimensions 5x5 represents our input image, which has been padded by adding rows of zeros on each side. when convolved using a 3x3 filter, the output dimensions are the same as that of input i.e 5x5 as depicted by a green square. Had we not used the padding, output dimensions would have been 3x3. Therefore padding of 1 kept the spatial dimensions of input and output the same.

Valid and Same Padding

Let’s understand the terminologies based on whether padding is being added or not.

Valid: When no padding is added during convolution and our resultant output is of shrunk size. Example:

Input size:   6 x 6   (i x i)
Filter size:  3 x 3   (f x f)
Output size: (i – f + 1)
             (6 – 3 + 1) = 4
             = 4 x 4

Same: When padding is added such that output size is the same as input size. To calculate the output dimension, our above formula is modified to factor in the padding parameter. Example:

Input size:        6 x 6   (i x i)
Filter size: 3 x 3 (f x f)
Padding: 1 (p)
Output size: (i + 2p - f +1)
(6 + 2x1 – 3 + 1) = 6
= 6 x 6

NOTE: By convention filter size f , is usually odd else padding will be asymmetric. Some of the most common filter sizes used are 3x3, 5x5, and 1x1.

Striding

In our working example, to calculate the next element we were shifting the filter by one to the right. The number of rows we’re moving our filter over the input image is our stride parameter.

The stride defines the step size of the filter when traversing the image. By default it’s probably 1 in any framework.

Deep Learning for Computer Vision

Convolution with stride s = 2 Source: https://github.com/vdumoulin/conv_arithmetic

In the figure (left), a blue square of 5x5 represents our input image. When convolved using a 3x3 filter, with a stride value of 2, we got a downsampled output map with dimensions 2x2. Had we kept stride of 1, the output dimensions would have been 3x3.

Therefore, we can increase the stride(step) length to save space or cut calculation time during training. However, we’ll be foregoing some information when doing so, hence it’s a trade-off between resource consumption (be it CPU or memory) and information retrieval from the input.

NOTE: In general we usually keep the stride value to 1 and use other ways to downsample our feature map like using the pooling layer.

Summary of Convolution Operations:

The prime objective of the convolutional operation is to extract meaningful information from input image like edges, curves, etc. Below animation summaries how elements are calculated in a convolutional operation.

Deep Learning for Computer Vision

Summary of Convolutional Operation

The below equation summarises dimensions for the output feature map.

Deep Learning for Computer Vision

Formula

Examples using the above equation:

6x6 input image,  3x3 filter    |    7x7 input image, 3x3 filter
padding p=1 & stride, s=1 | padding p=1 & stride, s=2
|
|
Output size: | Output size:
(6 + 2*1 – 3)/1 + 1 = 6 | (7 + 2*1 - 3)/2 = 4
6 x 6 | 4 x 4

Convolution Over Volumes:

We have done convolution over matrices. Let’s now understand how to perform the convolution operation over volumes which will make it much more powerful. Previously we had a grayscale image of 6x6. Now let’s consider we have a 6x6 RGB image, so it’ll have dimensions 6x6x3 and instead of having a 3x3 filter, this time we’ll have 3x3x3 filters.

The output will still be a size of 4x4 (same as before), however, elements will be calculated by doing the element-wise product in each channel and adding them together as depicted below:

Deep Learning for Computer Vision

Convolution over volume: element-wise product in each channel then adding them together. Source: https://indoml.com/2018/03/07/student-notes-convolutional-neural-networks-cnn-introduction/

One point to note here is that the number of channels in the input and filter has to be equal. Why we are doing this is because it allows us to use different filters across channels like having an edge detector in all channels, to extract more meaningful information. So the idea is the same. Get as much information in our output feature map as we can.

NOTE: The number of channels in the filter is usually not specified explicitly. It is assumed to be equal to the depth of input. For example, if we have an input of dimensions 26x26x64 and we are convolving using a filter of size 3x3, it is implicit that the number of channels in our filter will be 64, hence its actual dimensions are 3x3x64.

Convolutions Using Multiple Filters:

Now we know how to convolve over volumes, so how about the idea that we increase the number of filters. Each filter extract some features like one is extracting the vertical edges the other horizontal or lines at 45 degrees etc. In other words, extending the convolution to use multiple filters.

Expanding our working example, we still have an input of 6x6x3. Now instead of one, we have two filters of dimensions 3x3 (depth of 3 is implicit). The convolution operation is performed in a similar fashion using each filter. Therefore, we’ll get two 4x4 output feature maps. On stacking one on top of another, we can say the output dimensions be 4x4x2 as depicted below:

Deep Learning for Computer Vision

Convolution using 2 filters. The output is two 4x4 feature maps. One from each filter Source: https://indoml.com/2018/03/07/student-notes-convolutional-neural-networks-cnn-introduction/

So we can detect multiple features now based on our number of filters. The true power of convolution is unleashed since now we can extract a lot of semantic information from our input.

Let’s consider an example to understand the number of values (weights) present in multiple filters:

Input Volume dimensions: 26x26x64      
filter Size: 3x3(since input depth=64, filter depth will also be 64)
and we have 32 such filters being used for feature extraction
Hence, total number of weights in our filter will be:
weights in one filter = 3 x 3 x 64 = 576
Total filters = 32
Total weights = 32 x 576 = 18,432

This is all about the convolutional operation. Let’s now see how a typical convolution layer in a CNN looks like:

Convolutional Layer:

In the above example, we got two 4x4 output maps. Now for each of these output maps we are going to add a bias. Bias is a real number and we add it to all our 16 elements of each feature map. Bias is like the intercept added in a linear equation to model the real-world scenario. Then we’ll add non-linearity by applying an activation function.

A neural network without an activation function would simply be a linear regression model , which has limited power and does not work well most of the time. Without activation function our neural network would not be able to learn and model complicated kinds of data such as images, videos, etc.

There are multiple candidates for activation functions, most popular is the ReLU activation function.

ReLU function if sees a positive integer, will return the same number and when sees a negative number, it will return zero.

It rectifies the vanishing gradient problem. Also it is 6 times better in convergence from the tanh activation function. (More about activation functions will be covered in another post).

Below figure depicts and bias addition and applying ReLU activation to our example:

Deep Learning for Computer Vision

Entire computation of convolutional layer

This entire computation where we went from 6x6x3 to 4x4x2 output maps is one convolution layer in CNNs. The objective of the Convolution Operation is to extract the high-level features such as edges, from the input image. Convnets need not be limited to only a single convolutional layer.

Conventionally, the first convolutional layer is responsible for capturing low-level features such as edges, color, gradient orientation, etc. With added layers, the architecture adapts to the high-level features as well, giving us a network that has the wholesome understanding of images in the dataset, similar to how we would.

Pooling Layer:

Other than convolutional layers, convnets also use Pooling layers to reduce the size of our representation to speed up computation. Pooling is responsible for reducing the spatial size of the Convolved Feature and hence decrease the computational power required to process the data.

How the operation works is as follows:

Deep Learning for Computer Vision

In the figure (left), a 2x2 window is traversed over input image with stride 2 and the maximum value is retained in each quadrant. This is called the Max Pooling operation.

The most common form is a pooling layer with filters of size 2x2 applied with a stride of 2 downsamples every depth slice in the input by 2 along both width and height. The intuition is if there is a feature present in any quadrant, pooling will try to retain that feature by maintaining the high number with which the activation function will fire up

Pooling reduces the number of feature map coefficients to process and induce network to learn spatial hierarchies of features i.e. making successive convolution layers look at increasingly large windows in terms of the fraction of the original input they cover.

Deep Learning for Computer Vision

Two types of the Pooling operation

There are two types of Pooling: Max and Average Pooling . Max Pooling returns the maximum value from the portion of the image covered by the filter. Average Pooling returns the average of all the values from the portion of the image covered by the Kernel.

NOTE: There are no parameters to learn during training in the pooling layer. Pooling sizes with larger receptive fields are too destructive.

Fully Connected Layer:

The objective of a fully connected layer is to take the results of the convolution/pooling process and use them to classify the image into a label. These would act as identical to the layers in a traditional deep neural network. The main difference is the inputs would be in the shape and form created by the earlier stages of the CNN (Convolutional and Pooling layers).

Deep Learning for Computer Vision

Fully Connected Layers

Flattening is converting the data into a 1-dimensional array for inputting it to the fully connected layer.

Deep Learning for Computer Vision

Flatten, FC input layer, FC output layer

FC input layer takes a flattened vector as input and applies weights and activation to predict the correct label. The FC output layer gives the final probabilities for each label. The difference between both layers is the activation function. ReLU in input while softmax in output. (We’ll be covering that in more depth in a separate post)

Summary:

  • CNN primarily has two objectives: Feature extraction and classification.
  • CNNs have 3 layers namely, convolutional, pooling, and fully connected layers. Each CNN layer learns filters of increasing complexity.
  • The first layers learn basic feature detection filters: edges, corners, etc
  • The middle layers learn filters that detect parts of objects. For faces, they might learn to respond to eyes, noses, etc
  • The last layers have higher representations: they learn to recognize full objects, in different shapes and positions.
Deep Learning for Computer Vision
Sample CNN architecture

To implement a sample CNN, you can follow the guided implementation here .


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Search User Interfaces

Search User Interfaces

Marti A. Hearst / Cambridge University Press / 2009-9-21 / USD 59.00

搜索引擎的本质是帮助用户更快、更方便、更有效地查找与获取所需信息。在不断改进搜索算法和提升性能(以技术为中心)的同时,关注用户的信息需求、搜寻行为、界面设计与交互模式是以用户为中心的一条并行发展思路。创新的搜索界面及其配套的交互机制对一项搜索服务的成功来说是至关重要的。Marti Hearst教授带来的这本新作《Search User Interfaces》即是后一条思路的研究成果,将信息检索与人......一起来看看 《Search User Interfaces》 这本书的介绍吧!

在线进制转换器
在线进制转换器

各进制数互转换器

RGB CMYK 转换工具
RGB CMYK 转换工具

RGB CMYK 互转工具