Diving Deeper into Linear Regression

栏目: IT技术 · 发布时间: 4年前

Diving Deeper into Linear Regression

Photo by Artem Verbo on Unsplash

W hen I say “linear regression”, most of the people start thinking about the good old Ordinary Least Square (OLS) regression. If you are not familiar with the term, these equations might help…

Diving Deeper into Linear Regression
β_1, β_2: weights; β_0: bias; J(β): cost function

Did you also think about OLS? If yes then you are on the right track. But there’s more to linear regression than just OLS! First, let us look at OLS a bit more closely.

OLS

The name of this technique came from the cost function. Here, we take the sum of squared errors (the difference between ground truths and predictions) and try to minimize this. By minimizing the cost function we achieve the optimal value of the vector β (contains bias and weights). In the below plot, the contour (concentric ellipses) of the cost function is shown. After the minimization, we get β as the point at the center.

Diving Deeper into Linear Regression

OLS

At first, it seems like OLS is enough for any regression problem. But as we increase the number of features and the complexity of data OLS tends to overfit the training data. The concept of overfitting is vast and deserves a separate article (you can find plenty of them) so I’m going to give you a brief. Overfitting means the model has learned the training data so well that it fails to generalize. In other words, the model has learned even the small scale (insignificant) variations in the train data so it fails to produce good predictions on unseen (validation and test) data. To tackle the problem of overfitting we can use many techniques. Adding a regularization (penalty) term to our cost function is one such technique. But what term should we use? We generally use one of the following two methods.

Ridge

In this case, we add the sum of squares of weights to our least square cost function. So now it looks something like this…

m: 1+dimension of β; λ: regularization parameter

But how does this term prevent overfitting? Adding this term is equivalent to adding an extra constraint on the possible values of β. Because to achieve the minimum cost, the sum of β²_j’s must not exceed a certain value (say r). This technique prevents the model from assigning very large weights to some features over the others thus tackling overfitting. Mathematically,

Diving Deeper into Linear Regression

In other words, β should lie inside(or on) the circle with radius √r centered at the origin. Here’s the visualization…

Diving Deeper into Linear Regression

Ridge

Notice that because of the constraint (red circle), the final value of β is closer to the origin than it was in the OLS.

Lasso

The only difference between Ridge and Lasso is the regularization term. Here, we add the sum of absolute values of the weights to our least square cost function. So the cost function becomes…

In this case, the constraint can be written as…

Diving Deeper into Linear Regression

Now we can visualize the constraint as a square instead of a circle.

Diving Deeper into Linear Regression

Lasso

It is worth noting that, if the contours hit a corner of the square then one feature is completely neglected (weight becomes 0). For higher-dimensional feature space, we can use this trick to reduce the number of features.

Note: In the regularization term we are not using bias (β_0) because only the very large weights (β_i’s for i>0) corresponding to the features contribute to the overfitting. Bias term is just an intercept hence does not have much to do with the overfitting.

Phew…that was a lot about regularization. The common thing among the above methods was: they all have residuals/errors (ground truth-prediction) in their cost function. These errors are parallel to the y-axis. We could also consider errors along the x-axis and proceed similarly. See the plot below.

Diving Deeper into Linear Regression

y-errors and x-errors

What if we use a different kind of error?

Major axis (Orthogonal) regression

In this case, we consider errors in both directions (x-axis and y-axis). The sum of the square of perpendicular distances between the observed data points and the predicted line is to be minimized. Let’s visualize this by taking only one feature.

Diving Deeper into Linear Regression

(X_i, Y_i): the foot of the perpendicular drawn from (x_i, y_i) on the best fit line

Let our model be

Then the regression coefficients can be obtained by minimizing

Diving Deeper into Linear Regression

under the constraints

Reduced Major axis regression

This is very similar to the above method with a slight change. Here, we minimize the sum of areas of the rectangle formed by (X_i, Y_i) and (x_i, y_i).

Diving Deeper into Linear Regression

Reduced Major Axis

The total area extended by n data points is,

Diving Deeper into Linear Regression

The constraints here are the same as orthogonal regression.

When should you use orthogonal regression?

One should go for orthogonal and reduced major axis regressions when the uncerttainties are present in study (y) and explanatory (x) variables both.

One interesting thing in orthogonal regression is, it produces symmetrical fit w.r.t y-errors and x-errors. But in OLS we don’t get the symmetry for we minimize either y-errors or x-errors, not both.

Still curious? Watch a video that I made recently…

I hope you enjoyed the reading. Until next time…Happy learning!


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Web前端开发最佳实践

Web前端开发最佳实践

党建 / 机械工业出版社 / 2015-1 / 59.00元

本书贴近Web前端标准来介绍前端开发相关最佳实践,目的在于让前端开发工程师提高编写代码的质量,重视代码的可维护性和执行性能,让初级工程师从入门开始就养成一个良好的编码习惯。本书总共分五个部分13章,第一部分包括第1章和第2章,介绍前端开发的基本范畴和现状,并综合介绍前端开发的一些最佳实践;第二部分为第3-5章,讲解HTML相关的最佳实践,并简单介绍HTML5中新标签的使用;第三部分为第6-8章,介......一起来看看 《Web前端开发最佳实践》 这本书的介绍吧!

HTML 压缩/解压工具
HTML 压缩/解压工具

在线压缩/解压 HTML 代码

JS 压缩/解压工具
JS 压缩/解压工具

在线压缩/解压 JS 代码

HSV CMYK 转换工具
HSV CMYK 转换工具

HSV CMYK互换工具