If Rectified Linear Units Are Linear, How Do They Add Nonlinearity?

栏目: IT技术 · 发布时间: 3年前

One may be inclined to point out that ReLUs cannot extrapolate; that is, a series of ReLUs fitted to resemble a sine wave from -4 < x < 4 will not be able to continue the sine wave for values of x outside of those bounds. It’s important to remember, however, that it’s not the goal of a neural network to extrapolate, the goal is to generalize. Consider, for instance, a model fitted to predict house price based on number of bathrooms and number of bedrooms. It doesn’t matter if the model struggles to carry the pattern to negative values of number of bathrooms or values of number of bedrooms exceeding five hundred, because it’s not the objective of the model. (You can read more about generalization vs extrapolation here .)

The strength of the ReLU function lies not in itself, but in an entire army of ReLUs. This is why using a few ReLUs in a neural network does not yield satisfactory results; instead, there must be an abundance of ReLU activations to allow the network to construct an entire map of points. In multi-dimensional space, rectified linear units combine to form complex polyhedra along the class boundaries.

Here lies the reason why ReLU works so well: when there are enough of them, they can approximate any function just as well as other activation functions like sigmoid or tanh, much like stacking hundreds of Legos, without the downsides. There are several issues with smooth-curve functions that do not occur with ReLU — one being that computing the derivative, or the rate of change, the driving force behind gradient descent, is much cheaper with ReLU than with any other smooth-curve function.

Another is that sigmoid and other curves have an issue with the vanishing gradient problem; because the derivative of the sigmoid function gradually slopes off for larger absolute values of x . Because the distributions of inputs may shift around heavily earlier during training away from 0, the derivative will be so small that no useful information can be backpropagated to update the weights. This is often a major problem in neural network training.

On the other hand, the derivative of the ReLU function is simple; it’s the slope of whatever line the input is on. It will reliably return a useful gradient, and while the fact that x = 0 { x < 0} may sometimes lead to a ‘dead neuron problem’, ReLU has still shown to be, in general, more powerful than not only curved functions (sigmoid, tanh) but also ReLU variants attempting to solve the dead neuron problem, like Leaky ReLU.

ReLU is designed to work in abundance; with heavy volume it approximates well, and with good approximation it performs just as well as any other activation function, without the downsides.


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Java数据结构和算法

Java数据结构和算法

拉佛 / 计晓云 / 中国电力出版社 / 2004-02-01 / 55.00元

《Java数据结构和算法》(第2版)以一种易懂的方式教授如何安排和操纵数据的问题,其中不乏一些难题:了解这些知识以期使计算机的应用获得最好的表现。不管使用何种语言或平台,掌握了数据结构和算法将改进程序的质量和性能。 《Java数据结构和算法》(第2版)提供了一套独创的可视讨论专题用以阐明主要的论题:它使用Java语言说明重要的概念,而避免了C/C++语言的复杂性,以便集中精力论述数据结构和算法。经......一起来看看 《Java数据结构和算法》 这本书的介绍吧!

JS 压缩/解压工具
JS 压缩/解压工具

在线压缩/解压 JS 代码

Base64 编码/解码
Base64 编码/解码

Base64 编码/解码

RGB HSV 转换
RGB HSV 转换

RGB HSV 互转工具