内容简介：Some Call it Genius, Others Call it Stupid: The Most Controversial Neural Network Ever CreatedSome believe that the Extreme Learning Machine is one of the most genius neural network inventions ever created — so much so that there’s even a conference dedica
Some Call it Genius, Others Call it Stupid: The Most Controversial Neural Network Ever Created
The Extreme Learning Machine
Some believe that the Extreme Learning Machine is one of the most genius neural network inventions ever created — so much so that there’s even a conference dedicated exclusively to the study of ELM neural network architectures. Proponents of ELMs argue that it can perform standard tasks at exponentially faster training times, with few training examples. On the other hand, besides from the fact that it’s not big in the machine learning community, it’s got plenty of criticism from experts in deep learning, including Yann LeCun, who argue that it’s gotten far more publicity and credibility than it deserves.
An ELM architecture is composed of two layers; the first is randomly initialized and fixed, whereas the second is trainable. Essentially, the network randomly projects the data into a new space and performs a multiple regression (then passing it through an output activation function, of course). The random projection entails a dimensionality reduction (or enlargement) method, which multiplies random matrices by the input — although the idea may sound odd, pulling randomly from strategic distributions can actually work very well (as we’ll see with an intuitive analogy later). It applies a random distortion of sorts that applies noise — in a good way, if done correctly — and lets the remainder of the network to adapt, opening up new doors in terms of learning opportunities.
In fact, it is because of this randomness that Extreme Learning Machines have been shown to possess Universal Approximation Theorem powers with relatively small nodes in the hidden layer.
In fact, the idea of random projections have been explored early in the field of neural network development, in the 1980s and 1990s, under that name — which is the reasoning behind one critique that ELMs are nothing new; just old research packaged under a new name. Many other architectures like echo state machines and liquid state machines also utilize random skip connections and other sources of randomness.
Perhaps the largest difference, however, between ELMs and other neural network architectures is that it doesn’t use backpropagation. Instead, since the trainable half of the network is simply a multiple regression, the parameters are trained in roughly the same way coefficients are fitted in regression. This represents a fundamental shift in the way neural networks are thought to be trained.
Almost every neural network developed since the vanilla artificial neural network has been optimized using iterative updating (or call it tuning, if you’d like) by bouncing information signals forward and backward throughout the network. Because this method has been around for so long, one must assume that it’s been tried and tested as the best one, but researchers acknowledge that standard backpropagation has many issues, like being very slow to train or falling into very luring local minima.
On the other hand, ELM uses a much more mathematically involved formula to set weights, and without going too deeply into the math, one can think of using the random layer as compensating for more computationally expensive details that it would otherwise be replaced with. If it helps, technically, the wildly successful Dropout layer is a sort of random projection.
Because ELMs employ both randomness and a no-backpropagation algorithm, they are exponentially faster to train than standard neural networks. Whether they perform better or not is another question.
One could make the argument that ELMs are more reflective of how humans learn than standard neural networks (although both are far) in that it can solve simpler tasks very quickly with only a few examples, but iterative neural networks need to run through, at the very least, tens of thousands of samples to generalize and perform well. Humans may have their weaknesses in comparison to machines, but their vast superiority in learning-to-examples ratio (examples being the number of training examples they are exposed to) is what makes us truly intelligent.
The concept of the Extreme Learning Machine is very simple — so simple that some people may call it stupid. Yann LeCun, the great computer scientist and deep learning pioneer, declared that “connecting the first layer randomly is just about the stupidest thing you could do,” following this argument by listing more developed methods to non-linearly transform the dimensionality of vectors, such as kernel methods used in SVM, which are further strengthened via positioning with backpropagation.
In essence, LeCun says, the ELM is essentially an SVM, with a worse transformation kernel; the limited scope of problems that ELM is able to address would probably be better modelled with an SVM. The only rebuttal to this would be computational efficiency using a ‘random kernel’ instead of a specialized ones, as SVMs are notoriously high-power models; although whether it is worth a decrease in performance ELM may bring is another discussion to be had.
Yet, like the ELM or not — empirically, using random projections or filters in simple neural networks and other models have shown to perform shockingly well on a variety of (now, considered ‘simple’) standard training tasks, like MNIST. While these performances are not top-of-the-class, the fact that an architecture who has drawn so much scrutiny and whose concept almost comes across as ridiculous can edge itself up there on the leaderboard with state-of-the-art neural networks — in addition, with a much more lightweight architecture and an exponentially smaller computational bill — is something, at the very least, to be considered interesting.
Why would using fixed random connections work?
It’s the million dollar question: evidently, something with the random connections in the ELM is working if it performs just as well, if not better, than a vanilla backprop-neural network. While the mathematics of it is unintuitive, the author of the original Extreme Learning Machine paper, Guang Bin-Huang, told a parable to, illustrate the concept (edited for language, conciseness, and drawing deep learning parallels):
You want to fill up a lake with rocks until you get a horizontal surface filled with stones instead of water, and you can see the bottom of the empty lake, which is a curve (function representing the data). The engineer carefully calculates the size of the lake, the sizes of stones to fill it, and a plethora of other small factors that play a role in optimizing the task. (Optimizing many parameters that go into fitting the function.)
The rural farmer, on the other hand, blows up the nearby mountain and begins throwing or pushing the rocks that fall off into the lake. When the rural farmer picks up a stone (hidden layer node), he doesn’t need to know what the size of the lake or the size of the stone is — he just randomly throws them and spreads the rocks over. If rocks begin piling above the surface in one area, the farmer takes a hammer and smashes it (beta parameter — regularization of sorts), levelling the surface.
While the engineer is still calculating heights and volumes of the rocks and the shape of the lake, the farmer has already filled up the lake. To the farmer, it doesn’t matter how many rocks he threw: he got the job done faster.
Although this analogy has a few issues in direct application of differing scenarios, it is an intuitive explanation of the nature of the ELM and the role randomness plays in the model. The essence of ELM is that being naïve isn’t always a bad thing: simpler solutions may be able to address less complex problems better.
- Extreme Learning Machines use a fixed, random first layer and a trainable second layer. This is essentially a random projection, followed by a multiple regression.
- Proponents say that ELMs are able to learn with very few examples very quickly in simpler scenarios (like MNIST), with the advantage of being very easy to program and without the burden of needing to choose parameters like the architecture, optimizers, and losses. On the other hand, opponents argue that an SVM would be better in those scenarios, that ELM is unsuitable for more complex problems, and that it is simply a rebranding of a very old idea.
- ELMs generally do not perform well on complex tasks, but the fact that it can perform well on simpler tasks is a good reason to explore more the world of lightweight architectures, non-backpropagation model fitting, and random projections. At the very least, the Extreme Learning Machine — or whatever name you would like to brand the idea under — is an interesting concept.
“By far, the greatest danger of Artificial Intelligence is that people conclude too early that they understand it.”
— Eliezer Yudkowsky
All images created by author.
- Handling network call errors in Kotlin
- Call for Discussion: New Project: Leyden
- DConf Online 2020: Call For Submissions
Michael T. Nygard / Pragmatic Bookshelf / 2007-03-30 / USD 34.95
“Feature complete” is not the same as “production ready.” Whether it’s in Java, .NET, or Ruby on Rails, getting your application ready to ship is only half the battle. Did you design your system to......一起来看看 《Release It!》 这本书的介绍吧!