Understanding Regularization in Machine Learning

栏目: IT技术 · 发布时间: 5年前

Understanding Regularization in Machine Learning

Optimizing predictive models by preventing overfitting

Jun 10 ·11min read

Understanding Regularization in Machine Learning — Photo by Jackson Jost on Unsplash

W hen training machine learning models, one major aspect is to evaluate whether the model is overfitting the data. Overfitting generally occurs when a model attempts to fit all the datapoints, capturing noises in the process which lead to an inaccurate development of the model.

The performance of a machine learning model can be evaluated through a cost function. Generally, a cost function is represented by the sum of squares of the difference between the actual and predicted value.

This is also called the ‘ Sum of squared residuals ’ or ‘ Sum of squared errors ’. A predictive model when being trained attempts to fit the data in a manner that minimizes this cost function.

A model begins to overfit when it passes through all the data points. In such instances, although the value of the cost function is equal to zero, the model having considered the noise in the dataset, does not represent the actual function. Under such circumstances, the error calculated on training data is less. However, on the test data, the error remains huge.

Essentially a model overfits the data by employing highly complex curves having terms with large degrees of freedom and corresponding coefficients for each term that provide weight to it.

We can clearly observe the growing complexity of the curve from the equations.

One can observe from the above graph that for higher degrees of freedom the test set error is large when compared to the train set error.

Regularization is a concept by which machine learning algorithms can be prevented from overfitting a dataset. Regularization achieves this by introducing a penalizing term in the cost function which assigns a higher penalty to complex curves.

There are essentially two types of regularization techniques:-

L1 Regularization or LASSO regression
L2 Regularization or Ridge regression

Let’s first begin with understanding L2 regularization or ridge regression.

L2 Regularization or Ridge regression

The cost function for ridge regression is given by:

Here lambda ( ) is a hyperparameter and this determines how severe the penalty is. The value of lambda can vary from 0 to infinity. One can observe that when the value of lambda is zero, the penalty term no longer impacts the value of the cost function and thus the cost function is reduced back to the sum of squared errors.

To appreciate the significance for the penalty term, let’s delve into an example.

Suppose we evaluate the performance of our model just on the basis “sum of squared errors”, we get the curve represented by the graph on the left in the below image.

As mentioned previously, the penalty term no longer impacts the value of the cost function. Hence, we get the same overfitting curve. However, when the value of lambda is increased, we get a simpler curve represented by the graph on the right in the above image.

Comparing the mean squared errors for the two models, we observe that the error on the training set is least for the overfitting curve but there is a significant drop in error observed on the test set for the simpler curve.

So by making our curve simple, we introduce some error in the training set but this enables us to move towards a more generalized model.

One important aspect that needs to be emphasized is that by altering the values of lambda and converting our complex curve into a simple one, we are dealing with the same 15 degree polynomial model. The terms up to 15 degrees still exist in the simple model’s equation and yet the model has reduced in complexity.

How was this achieved?

The answer lies in the mechanism of penalty itself. Let’s take a look at the cost function again.

Lambda is a hyperparameter determining the severity of the penalty. As the value of the penalty increases, the coefficients shrink in value in order to minimize the cost function. Since these coefficients also act as weights for the polynomial terms, shrinking these will reduce the weight assigned to them and ultimately reduce its impact. Therefore, for the case above, the coefficients assigned to higher degrees of polynomial terms have shrunk to an extent where the value of such terms no longer impacts the model as severely as it did before and so we have a simple curve.

After identifying the optimal value of lambda, we apply it to the model and get the below curve.

Effect of varying the values of lambda

We observe that as the value for lambda increases, the model grows further simple until it is asymptotically parallel to the x-axis. In other words for a very high value of lambda, we have a highly biased model.

How to choose the value of lambda?

This brings us to a quandary. For a very low value of lambda, an overfitting curve is obtained and for a very high value of lambda, an underfitting or highly biased model is obtained. How then can an optimal value of lambda be achieved?

The answer to this is cross validation. Typically a 10 fold cross validation can help us identify the optimal lambda value.

Multidimensional data set

Suppose we are trying to predict the physical size of an animal. Let’s say that this size depends on its weight and age. In this case, our model’s function can be represented as:-

As we have multiple features on which the size depends, the cost can be given by:-

If we have a limited number of data points in our training set, ridge regression can improve predictions made from the data by reducing the variance and making the model less sensitive to the “Training data” distribution and thereby making the curve less sensitive to noise.

L1 Regularization or LASSO regression

LASSO stands for Least Absolute Shrinkage and Selection Operator .

The cost function for Lasso regression is given by:

Similar to ridge regression lambda is a hyperparameter and it determines how severe the penalty is. The difference in the cost function is that ridge regression takes the square of the slope and lasso regression takes the absolute value of the slope.

Effect of changing lambda

As the value of lambda increases, the value of coefficients will get closer to 0 until the value is ultimately 0.

Note that in ridge regression, the value of the coefficients shrunk when lambda increased until the model was asymptotically parallel to x-axis. In lasso regression, for a large value of lambda, the model will actually become parallel to the x-axis.

To appreciate this, let’s take an example.

Suppose the size of an animal is given by the equation:-

Lasso regression works in a way that features that are most relevant are retained while the others are shrunk.

The cost function for the above equation can be given as:-

In the above example “Weight” and “Food calorie content” are most relevant for size while “Zodiac sign” and “Wind speed” are least relevant. Therefore, “coeff1” and “coeff2” will shrink a little bit while “coeff3” and “coeff4” will be shrunk all the way to zero.

Thus, we’ll be left with the equation:-

Since lasso regression can exclude insignificant variables from the equation, it is a little better than ridge regression at reducing variance in models containing a lot of useless variables. In other words, lasso regression can help in feature selection.

In contrast, ridge regression tends to do a little better when most variables are useful.

Another interesting difference between LASSO and Ridge regression

It was mentioned earlier that as the value of lambda continues to increase in ridge regression, the model curve grows flatter until it is asymptotically parallel to the x-axis.

To understand this, we shall refer to a linear regression model for simplicity.

The dots represent the data points and the line represents the regression model. Since this is a straight line, it has a slope and a y-intercept.

As the value of lambda increases the slope of the linear regression model will continuously decrease.

At a large value of lambda, we observe the below curve for the regression model.

Although it appears that the line is parallel to the x-axis, but in reality, the value of the slope for the linear model is slightly greater than zero.

This behaviour can be visually observed through the graph below. The x-axis represents the slope for model and the y-axis represents the cost function value for the model. To the left side of each graph are a range of values of lambda.

The blue dots in the curve represent the least value of the model’s cost function. As the value of lambda increases, the lowest value of the cost function for a ridge regression model continuously moves closer to a slope value of zero but the lowest value never coincides with zero.

In contrast in a lasso regression model, as the value of lambda increases, we observe a similar trend where the lowest value of the cost function gradually moves towards a slope value of zero. However, in this case for a large value of lambda, the lowest value of the cost function is achieved when the value of slope coincides with zero. Particularly for a value of lambda equal to 40, 60 and 80, we observe a noticeable kink in the graph where the value of slope is equal to zero, this coincidentally also is the lowest value of the cost function where lambda is equal to 80.

Hence, for a large value of lambda, lasso regression models can have a slope equal to zero.

How does Lasso Regression help in feature selection?

Let’s take a look at the cost function for each type regularization techniques.

For Lasso Regression:-

For Ridge Regression:-

One way the lasso regression can be interpreted is as solving an equation where the sum of the modulus of the coefficients is less than or equal to a constant “c”. Similarly ridge regression can be interpreted as solving an equation where the sum of the squares of the coefficients is less than equal to a constant “c”.

Suppose our model incorporates two features to predict certain entity and that the coefficients for these features is given by β1 and β2.

In such an instance, ridge regression can be expressed by:-

Similarly for lasso regression can be expressed by:-

The equation for ridge regression resembles a circle’s equation and thus the constraint region lies within and on the circumference of the circle, similarly the equation for lasso regression resembles that of a diamond with the constraint region lying inside and on the periphery of this shape.

The equations have been visualized in the image below.

The constraint regions have been represented by light blue areas while the red ellipses are the contours that represent the “Sum of squared errors”. The value of the “Sum of squared errors” will be same in any particular given contour. The further the contour is from the center, the higher the value of “Sum of squared error” is.

If the value of “c” is sufficiently large, the constraint region will contain the center of the contour represented by β (hat) and so the value of ridge regression and lasso regression will be same as the least value of “Sum of squared errors”. This corresponds with the case where lambda = 0.

The coefficient estimates for ridge and lasso regression is given by the first point where the contour contacts the constraint region. Since ridge regression has a circular constraint region, the point of contact would generally not occur on the axes. Consequently the coefficient estimates will be mostly non-zero.

However, since lasso regression constraint region has corners jutting out, there’s a greater chance that the contours will intersect the constraint region on the axes. In such cases the value of one of the coefficients will be equal to zero. In higher dimensional space, where we have more than two features, many coefficient estimates may be equal to zero simultaneously.

Conclusion

Regularization is an effective technique to prevent a model from overfitting. It allows us to reduce the variance in a model without a substantial increase in it’s bias. This method allows us to develop a more generalized model even if only a few data points are available in our dataset.

Ridge regression helps to shrink the coefficients of a model where the parameters or features that determine the model is already known.

In contrast, lasso regression can be effective to exclude insignificant variables from the model’s equation. In other words, lasso regression can help in feature selection.

Overall, it’s an important technique that can substantially improve the performance of our model.

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Understanding Regularization in Machine Learning

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

大学程序设计课程与竞赛训练教材

吴永辉、王建德 / 机械工业出版社 / 2013-6 / 69.00

本书每章为一个主题，实验内容安排紧扣大学算法和数学的教学，用程序设计竞赛中的算法和数学试题作为实验试题，将算法和数学的教学与程序设计竞赛的解题训练结合在一起；在思维方式和解题策略的训练方面，以问题驱动和启发式引导为主要方式，培养读者通过编程解决问题的能力。本书特点：书中给出的234道试题全部精选自ACM国际大学生程序设计竞赛的世界总决赛以及各大洲赛区现场赛和网络预赛、大学程序设计竞......一起来看看《大学程序设计课程与竞赛训练教材》这本书的介绍吧!

码农工具

Understanding Regularization in Machine Learning