Gradient Descent Extensions to Your Deep Learning Models

栏目: IT技术 · 发布时间: 4年前

内容简介:Learn about the different available methods, and to select the one most appropriate to solve your problem.The objective of this article is to explore the different Gradient Descent extensions such as Momentum, Adagrad, RMSprop…Inprevious articles, we have

Learn about the different available methods, and to select the one most appropriate to solve your problem.

Gradient Descent Extensions to Your Deep Learning Models

Source: Pixabay

Introduction

The objective of this article is to explore the different Gradient Descent extensions such as Momentum, Adagrad, RMSprop…

Inprevious articles, we have studied three methods to implement back-propagation in Deep Learning models:

  • Gradient Descent
  • Stochastic Gradient Descent
  • Mini-Batch Stochastic Gradient Descent

Upon which, we keep the mini-batch because it allows for greater speed, as it does not have to calculate gradients and errors for the entire dataset, and eliminates the high variability that exists in the Stochastic Gradient Descent.

Well, there are improvements over these methods, such as Momentum. Besides, there are other more complex algorithms such as Adam, RMSProp or Adagrad.

Let’s see them!

Momentum

Imagine being a kid again and having the great idea of putting on your skates, climbing up the steepest street and starting to go down it. You are total beginners and this is the second time you have worn skates.

I don’t know if any of you have ever really done this, but well, I have, so let me explain what happens:

  • You just start, the speed is small, you even seem to be in control and you could stop at any time.
  • But the lower you go, the faster you move: this is called momentum.
    so the more road you go down, the more inertia you carry and the faster you go.
  • Well, for those of you who are curious, the end of the story is that at the end of the steep street there is a fence. The rest you can imagine…

Well, the Momentum technique is precisely this. As we go down our loss curve when calculating the gradients and making the updates, we give more importance to the updates that go in the direction that minimizes the gradient, and less importance to those that go in other directions.

Gradient Descent Extensions to Your Deep Learning Models
Figure by the Author

So, the result is to speed up the training of the network.

Also, thanks to the moment, we could have been able to avoid small potholes or holes in the road (flying over them thanks to the speed).

You can learn more about the mathematic foundation behind this technique in this great post: http://cs231n.github.io/neural-networks-3/#sgd

Nesterov Momentum

Going back to the example of before: we are going down the road at full speed (because we have built a lot of momentum) and suddenly we see the end of it. We would like to be able to brake, to slow down to avoid crashing. Well, this is precisely what Nesterov does.

Nesterov calculates the gradient, but instead of doing it at the current point, it does it at the point where we know our moment is going to take us, and then apply a correction.

Figure by Author

Notice that using the standard moment, we calculate the gradient (small orange vector) and then take a big step in the direction of the gradient (large orange vector).

Using Nesterov, we would first make a big jump in the direction of our previous gradient (green vector), measure the gradient and make the appropriate correction (red vector).

In practice, it works a little better than the momentum alone. It’s like calculating the gradient of weights in the future (because we have added the moment we had calculated).

You can learn more about the mathematic foundation behind this technique in this great post: http://cs231n.github.io/neural-networks-3/#sgd

Both Nesterov’s momentum and the standard momentum are extensions of the SGD.

The methods that we are going to see now are based on adaptive learning rates, allowing us to accelerate or slow down the speed with which we update the weights. For example, we could use a high speed at the beginning, and lower it as we approach the minimum.

Adaptive gradient (AdaGrad)

It keeps a history of the calculated gradients (in particular, of the sum of the squared gradients) and normalizes the “step” of the update.

The intuition behind it is that it identifies the parameters with a very high gradient, which weights update will be very abrupt and then assign to them a lower learning rate to mitigate this abruptness.

At the same time, the parameters that have a very low gradient will be assigned a high learning rate.

In this way, we manage to accelerate the convergence of the algorithm.

You can learn more about the theory behind this technique in its original paper here: http://jmlr.org/papers/v12/duchi11a.html

RMSprop

The problem with AdaGrad is that when calculating the sum of the squared gradients, we are using a monotonic increasing function, which can cause the learning rate to try to compensate values that do not stop growing until it becomes zero, thus stopping learning.

What RMSprop proposes is to decrease that sum of the squared gradients using a decay_rate.

The paper is not published yet, but you can read more about it here: http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

Adam

Finally, Adam is one of the most modern algorithms, which improves RMSprop by adding momentum to the update rule. It introduces 2 new parameters, beta1 and beta2, with recommended values of 0.9 and 0.999.

You can check out its paper here: https://arxiv.org/abs/1412.6980 .

But then, which one should we use?

Gradient Descent Extensions to Your Deep Learning Models

Source: original ADAM paper

As a rule of thumb, the recommendation is to start with Adam. If it does not works well, then you can try and tune the rest of the techniques. But most of the time, Adam works great.

You can check these resources to gain a better understanding of these techniques, how and when to apply them:

Final Words

As always, I hope you enjoyed the post!

If you liked this post then you can take a look at my other posts on Data Science and Machine Learning here .

If you want to learn more about Machine Learning, Data Science and Artificial Intelligence follow me on Medium , and stay tuned for my next posts!


以上所述就是小编给大家介绍的《Gradient Descent Extensions to Your Deep Learning Models》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Java Web高级编程

Java Web高级编程

威廉斯 (Nicholas S.Williams) / 王肖锋 / 清华大学出版社 / 2015-6-1 / CNY 99.80

Java成为世界上编程语言之一是有其优势的。熟悉JavaSE的程序员可以轻松地进入到Java EE开发中,构建出安全、可靠和具有扩展性的企业级应用程序。编写《Java Web高级编程——涵盖WebSockets、Spring Framework、JPA Hibernate和 Spring Security》一书的目的正是如此。 《Java Web高级编程:涵盖WebSockets、Sp......一起来看看 《Java Web高级编程》 这本书的介绍吧!

在线进制转换器
在线进制转换器

各进制数互转换器

MD5 加密
MD5 加密

MD5 加密工具

正则表达式在线测试
正则表达式在线测试

正则表达式在线测试