My Deep Learning Model Says: “sorry, I don’t know the answer”. That’s Absolutely OK.

栏目: IT技术 · 发布时间: 6年前

内容简介：Although deep learning technique works, it’s most of the time unclearA neural network identifies that a cell biopsy is cancerous — It does not tell why.Typically, a classifier model is forced to decide between two possible outcomes even though it does not

Motivation

Although deep learning technique works, it’s most of the time unclear why deep learning works . This makes it tricky to deploy artificial intelligence in high-risk areas like aviation, judiciary, and medicine.

A neural network identifies that a cell biopsy is cancerous — It does not tell why.

Typically, a classifier model is forced to decide between two possible outcomes even though it does not have any clue. It has just flipped a coin. in real life, a model for medical diagnosis should not only care about the accuracy but also about how certain the prediction is. If the uncertainty is too high, a doctor should take this into account in his decision process.

A deep learning model should be able to say: “sorry, I don’t know”.

A model for self-driving cars that has learned from an insufficiently diverse training set is another interesting example. If the car is unsure where there is a pedestrian on the road, we would expect it to let the driver take charge.

Networks with greater generalization are less interpretable. Interpretable networks don’t generalize well. ( source )

Some models may not require explanations because they are used in low-risk applications, such as a product recommender system. Nevertheless, integrating critical models into our daily lives requires interpretability to increase the social acceptance of AI . This is because people like to attribute beliefs, desires, and intentions to things ( source ).

Understanding and explaining what a neural network doesn’t know is crucial for the end-users. Practitioners also seek better interpretability to build more robust models that are resistant toadversarial attacks.

My Deep Learning Model Says: “sorry, I don’t know the answer”. That’s Absolutely OK. — Images *By Goodfellow et al, ICLR 2015.* *Explaining and Harnessing Adversarial Examples* . Adding a little noise to a photo of panda causes incorrect classification as gibbon.

In the following sections, we will have a closer look at the concept of uncertainty. We also introduce easy techniques for how to assess uncertainty in deep learning models.

Types of uncertainty

There are two major different types of uncertainty in deep learning: epistemic uncertainty and aleatoric uncertainty. Both terms do not roll off the tongue easily.

Epistemic uncertaintydescribes what the model does not know because training data was not appropriate. Epistemic uncertainty is due to limited data and knowledge. Given enough training samples, epistemic uncertainty will decrease. Epistemic uncertainty can arise in areas where there are fewer samples for training.

Aleatoric uncertaintyis the uncertainty arising from the natural stochasticity of observations. Aleatoric uncertainty cannot be reduced even when more data is provided. When it comes to measurement errors, we call it homoscedastic uncertainty because it is constant for all samples. Input data-dependent uncertainty is known as heteroscedastic uncertainty.

The illustration below represents a real linear process ( y=x ) that was sampled around x=-2.5 and x=2.5 .

A sensor malfunction introduced noise in the left cloud. Noisy measurements of the underlying process leading to high aleatoric uncertainty in the left cloud. This uncertainty cannot be reduced by additional measurements, because the sensor keeps producing errors around x=-2.5 by design .

High epistemic uncertainty arises in regions where there are few observations for training. This is because too many plausible model parameters can be suggested for explaining the underlying ground truth phenomenon. This is the case of the left and right parts of our clouds. Here we are not sure which model parameters describe the data best. Given more data in that space uncertainty would decrease. In high-risk applications, it is important to identify such spaces.

How to access uncertainty using Dropout

Bayesian statistics allow us to derive conclusions based on both data and our prior knowledge about the underlying phenomenon. One of the key distinctions is that parameters are distributions instead of fixed weights.

If instead of learning the model’s parameters, we could learn a distribution over them, we would be able to estimate uncertainty over the weights.

How can we learn the weights’ distribution? Deep Ensembling is a powerful technique where a large number of models or re-multiple copies of a model are trained on respective datasets and their resulting predictions collectively build a predictive distribution.

Because ensembling can require plentiful computing resources an alternative approach was suggested: Dropout as a Bayesian Approximation of a model ensemble. This technique was introduced by Yarin Gal and Zoubin Ghahramani in their 2017’s paper .

Dropout is a well-used practice as a regularizer in deep learning to avoid overfitting. It consists of randomly sample network nodes and drop them out during training. Dropout zeros out neurons randomly according to a Bernoulli distribution.

In general, there seems to be a strong link between regularization and prior distributions in Bayesian models. Dropout is not the only example. The frequently used L2 regularization is essentially a Gaussian prior.

In their paper, Yarin and Zoubin showed that a neural network with dropout applied before every weight layer is mathematically equivalent to a Bayesian approximation of the Gaussian process.

With droupout, each subset of nodes that is not dropped out defines a new network. The training process can be thought of as training 2^m different models simultaneously, where m is the number of nodes in the network. For each batch, a randomly sampled set of these models is trained.

The key idea is to do dropout at both training and testing time. At test time, the paper suggests repeating prediction a few hundreds times with random dropout. The average of all predictions is the estimate. For the uncertainty interval, we simply calculate the variance of predictions. This gives the ensemble’s uncertainty.

Predicting Epistemic Uncertainty

We will assess epistemic uncertainty on a regression problem using data generated by adding normally distributed noise to the function y=x as follows:

100 data points are generated in the left cloud between x=-2 and x=-3
100 data points are generated in the right cloud between x=2 and x=3.
Noise is added to the left cloud with 10 times higher variance than the right cloud.

Below we design two simple neural networks, one without dropout layers and a second one with a dropout layer between hidden layers. The dropout layer randomly disables 5% on neurons during each training and inference batch. We also include L2 regularizers to apply penalties on layer parameters during optimization.

The rmsprop optimizer is used to train batches of 10 points by minimizing the mean squared errors. The training performance is displayed below. Convergence is very fast for both models. The model with dropout exhibits slightly higher loss with more stochastic behavior. This is because random regions of the network are disabled during training causing the optimizer to jump across local minima of the loss function.

Below, we show how the models perform on test data. The model without dropout predicts a straight line with a perfect R2 score. Including dropout caused a nonlinear prediction line with an R2 score of 0.79. Although dropout overfits less, has higher bias, and decreased accuracy, it highlights uncertainty in predictions in the regions without training samples. The prediction line has higher variance in those regions, which can be used to computed epistemic uncertainty.

Below, we evaluate both models (with and without dropout) on a test dataset, while using dropout layers at evaluation a few hundreds of times. This is equivalent to simulating a Gaussian process. We obtain each time, a range of output values for each input scalar from test data. This allows us to compute the standard deviation of the posterior distribution and display it as a measure of epistemic uncertainty .

As expected, data for x <-3 and x>3 have high epistemic uncertainty as no training data is available at these points.

Dropout allows the model to say: “all my predictions for x <-3 and x>3 are just my best guess.”

Polynomial Regression

In this section, we investigate how to assess epistemic uncertainty by dropout for more complex tasks, such as polynomial regression.

For this purpose, we generate a synthetic training dataset randomly sampled from a sinusoidal function, and adding noise of different amplitudes.

The results below suggest that including dropout brings a way to access epistemic uncertainty in the region where there is no data, even for nonlinear data. Although dropout affects model performance, it clearly shows that predictions are less certain in data regions where there were not enough training samples.

Predicting Aleatoric Uncertainty

While epistemic uncertainty is a property of the model, aleatoric uncertainty is a property of the data. Aleatoric uncertainty captures our uncertainty concerning information that our data cannot explain.

When aleatoric uncertainty is a constant, not dependent on the input data, it is called homoscedastic uncertainty , otherwise, the term heteroscedastic uncertainty is used.

Heteroscedastic uncertainty depends on the input data and therefore can be predicted as a model output. Homoscedastic uncertainty can be estimated as a task-dependent model parameter.

Learning heteroscedastic uncertainty is done by replacing the mean-squared error loss function with the following ( source ):

The model predicts both a mean y ^ and variance σ ². If the residual is very large, the model will tend to predict large variance. The log term prevents the variance to grow infinitely large. An implementation of this aleatoric loss function in Python is provided below.

The aleatoric loss can be used to train a neural network. Below, we illustrate an architecture that is similar to the one used for epistemic uncertainty in the previous section with two differences:

there is no dropout layer between hidden layers,
the output is a 2D tensor instead of a 1D tensor. This allows the network to learn not only the response y^ , but also the variance σ ².

The learned loss attenuation forced the network to find weights and variance which minimize the loss during training, as shown below.

Inference for aleatoric uncertainty is done without dropout.The result below confirms our expectation: the aleatoric uncertainty is higher for data on the left than on the right. The left region has noisy data due to a sensor error around x=-2.5 . Adding more samples wouldn’t fix the problem. Noise will still be present in that region. By including aleatoric uncertainty in the loss function, the model will predict with less confidence for test data falling in the regions, where training samples were noisy.

Measuring aleatoric uncertainty can become crucial in computer vision. Such uncertainty in images can be attributed to occlusions when cameras can’t see through objects. Aleatoric uncertainty can also be caused by over-exposed regions of images or the lack of some visual features.

Both epistemic and aleatoric uncertainty can be summed up to provide total uncertainty. Including the total level of uncertainty in predictions of a self-driving car can be very useful.

Conclusion

In this article we demonstrated how using Dropout at inference time is equivalent to doing Bayesian approximation for assessing uncertainty in deep learning predictions.

Knowing how confident a model is with its predictions is important in a business context. Uber has been using this technique to assess uncertainty in time-series predictions .

Properly including uncertainty in machine learning can also help to debug models and making them more robust against adversarial attacks. The new TensorFlow Probability offers probabilistic modeling as add-ons for deep learning models .

You can read further through my article about responsible data science and see what can go wrong when we trust our machine learning models a little too much. This comprehensive introduction to deep learning and practical guide to Bayesian inference can help deepen and challenge classical approaches to deep learning.

Thanks to Anne Bonner from Towards Data Science for her editorial notes.

Stay safe in uncertain times.

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

精通Spring 4.x

陈雄华、林开雄、文建国 / 电子工业出版社 / 2017-1-1 / CNY 128.00

Spring 4.0是Spring在积蓄4年后，隆重推出的一个重大升级版本，进一步加强了Spring作为Java领域第一开源平台的翘楚地位。Spring 4.0引入了众多Java开发者翘首以盼的基于Groovy Bean的配置、HTML 5/WebSocket支持等新功能，全面支持Java 8.0，最低要求是Java 6.0。这些新功能实用性强、易用性高，可大幅降低Java应用，特别是Java W......一起来看看《精通Spring 4.x》这本书的介绍吧!

码农工具