Gradient Descent Extensions to Your Deep Learning Models

栏目: IT技术 · 发布时间: 5年前

内容简介：Learn about the different available methods, and to select the one most appropriate to solve your problem.The objective of this article is to explore the different Gradient Descent extensions such as Momentum, Adagrad, RMSprop…Inprevious articles, we have

Learn about the different available methods, and to select the one most appropriate to solve your problem.

Gradient Descent Extensions to Your Deep Learning Models — Source: Pixabay

Introduction

The objective of this article is to explore the different Gradient Descent extensions such as Momentum, Adagrad, RMSprop…

Inprevious articles, we have studied three methods to implement back-propagation in Deep Learning models:

Gradient Descent
Stochastic Gradient Descent
Mini-Batch Stochastic Gradient Descent

Upon which, we keep the mini-batch because it allows for greater speed, as it does not have to calculate gradients and errors for the entire dataset, and eliminates the high variability that exists in the Stochastic Gradient Descent.

Well, there are improvements over these methods, such as Momentum. Besides, there are other more complex algorithms such as Adam, RMSProp or Adagrad.

Let’s see them!

Momentum

Imagine being a kid again and having the great idea of putting on your skates, climbing up the steepest street and starting to go down it. You are total beginners and this is the second time you have worn skates.

I don’t know if any of you have ever really done this, but well, I have, so let me explain what happens:

You just start, the speed is small, you even seem to be in control and you could stop at any time.
But the lower you go, the faster you move: this is called momentum.
so the more road you go down, the more inertia you carry and the faster you go.
Well, for those of you who are curious, the end of the story is that at the end of the steep street there is a fence. The rest you can imagine…

Well, the Momentum technique is precisely this. As we go down our loss curve when calculating the gradients and making the updates, we give more importance to the updates that go in the direction that minimizes the gradient, and less importance to those that go in other directions.

So, the result is to speed up the training of the network.

Also, thanks to the moment, we could have been able to avoid small potholes or holes in the road (flying over them thanks to the speed).

You can learn more about the mathematic foundation behind this technique in this great post: http://cs231n.github.io/neural-networks-3/#sgd

Nesterov Momentum

Going back to the example of before: we are going down the road at full speed (because we have built a lot of momentum) and suddenly we see the end of it. We would like to be able to brake, to slow down to avoid crashing. Well, this is precisely what Nesterov does.

Nesterov calculates the gradient, but instead of doing it at the current point, it does it at the point where we know our moment is going to take us, and then apply a correction.

Figure by Author

Notice that using the standard moment, we calculate the gradient (small orange vector) and then take a big step in the direction of the gradient (large orange vector).

Using Nesterov, we would first make a big jump in the direction of our previous gradient (green vector), measure the gradient and make the appropriate correction (red vector).

In practice, it works a little better than the momentum alone. It’s like calculating the gradient of weights in the future (because we have added the moment we had calculated).

You can learn more about the mathematic foundation behind this technique in this great post: http://cs231n.github.io/neural-networks-3/#sgd

Both Nesterov’s momentum and the standard momentum are extensions of the SGD.

The methods that we are going to see now are based on adaptive learning rates, allowing us to accelerate or slow down the speed with which we update the weights. For example, we could use a high speed at the beginning, and lower it as we approach the minimum.

Adaptive gradient (AdaGrad)

It keeps a history of the calculated gradients (in particular, of the sum of the squared gradients) and normalizes the “step” of the update.

The intuition behind it is that it identifies the parameters with a very high gradient, which weights update will be very abrupt and then assign to them a lower learning rate to mitigate this abruptness.

At the same time, the parameters that have a very low gradient will be assigned a high learning rate.

In this way, we manage to accelerate the convergence of the algorithm.

You can learn more about the theory behind this technique in its original paper here: http://jmlr.org/papers/v12/duchi11a.html

RMSprop

The problem with AdaGrad is that when calculating the sum of the squared gradients, we are using a monotonic increasing function, which can cause the learning rate to try to compensate values that do not stop growing until it becomes zero, thus stopping learning.

What RMSprop proposes is to decrease that sum of the squared gradients using a decay_rate.

The paper is not published yet, but you can read more about it here: http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

Adam

Finally, Adam is one of the most modern algorithms, which improves RMSprop by adding momentum to the update rule. It introduces 2 new parameters, beta1 and beta2, with recommended values of 0.9 and 0.999.

You can check out its paper here: https://arxiv.org/abs/1412.6980 .

But then, which one should we use?

As a rule of thumb, the recommendation is to start with Adam. If it does not works well, then you can try and tune the rest of the techniques. But most of the time, Adam works great.

You can check these resources to gain a better understanding of these techniques, how and when to apply them:

Final Words

As always, I hope you enjoyed the post!

If you liked this post then you can take a look at my other posts on Data Science and Machine Learning here .

If you want to learn more about Machine Learning, Data Science and Artificial Intelligence follow me on Medium , and stay tuned for my next posts!

以上所述就是小编给大家介绍的《Gradient Descent Extensions to Your Deep Learning Models》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

猜你喜欢:

Gradient Descent Extensions to Your Deep Learning Models

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

代码之髓

[日] 西尾泰和 / 曾一鸣 / 人民邮电出版社 / 2014-8 / 45.00元

《代码之髓：编程语言核心概念》作者从编程语言设计的角度出发，围绕语言中共通或特有的核心概念，通过语言演变过程中的纵向比较和在多门语言中的横向比较，清晰地呈现了程序设计语言中函数、类型、作用域、类、继承等核心知识。本书旨在帮助读者更好地理解各种概念是因何而起，并在此基础上更好地判断为何使用、何时使用及怎样使用。同时，在阅读本书后，读者对今后不断出现的新概念的理解能力也将得到提升。《代码之髓：......一起来看看《代码之髓》这本书的介绍吧!

码农工具