Assumptions in Linear Regression you might not know.

栏目: IT技术 · 发布时间: 3年前

DS INTO THE REAL WORLD

Assumptions in Linear Regression you might not know.

The model should conform to these assumptions to produce a best Linear Regression fit to the data.

Jul 16 ·6min read

Assumptions in Linear Regression you might not know.

Photo by Joseph Barrientos on Unsplash

— All the images (plots) are generated and modified by Author.

Introduction

At first, Linear Regression is a method of modelling the best linear relationship between the independent variables and dependent variables. The simplest form of Linear Regression can be defined by the following equation with one independent and one dependent variable:

Assumptions in Linear Regression you might not know.
Simple Linear Regression

xis the independent variable,

yis the dependent variable,

β 1 is the coefficient of x, i.e. slope,

β 0 is the intercept (constant) which tells the distance of the line from the origin on y-axis.

Linear regressionis a linear approach to modelling the relationship between a scalar response (or dependent variable ) and one or more explanatory variables (or independent variables ).

Wikipedia

Linear Regression Types

1. Simple Linear Regression— The simplest form of regression which involves one independent variable and one dependent variable, which is explained as above, where we fit a line to the model.

2. Multiple Linear Regression— The complex form of regression which involves multiple independent variables and one dependent variable, which can be explained by the following equation:

Multiple Linear Regression

x1to xn are the independent variable,

yis the dependent variable,

β 1 to β n are the coefficients of respective x features, and

β 0 is the intercept (constant) which tells the distance of the line from the origin on y-axis.

Assumptions in Linear Regression

Assumptions in Linear Regression you might not know.

Photo by Tom Roberts on Unsplash

1. Linear Relationship— It is assumed and understood that the relation between the independent variables and dependent variables is linear, i.e. the coefficients must be linear, what we find out using the model building and prediction.

Assumptions in Linear Regression you might not know.

Image by Author

The predictor variables are seen as fixed values and can be any complex function like polynomial, trigonometric, etc. But the coefficients will be strictly linear with the predictor variable.

Polynomial Regression

This assumption is used for implementing the Polynomial regression , which uses linear regression to fit the response variable as an arbitrary polynomial function of a predictor variable which also makes the linear relationship with the coefficients.

2. Homoscedasticity (Constant Variance)— It is assumed that the residual terms (that is, the “noise” or random disturbance in the relationship between the features and the target) must have the constant variance, i.e. the error term is same across different values of independent features, regardless of the values of the predictor variables.

Assumptions in Linear Regression you might not know.

Image by Author — Modified

There should be no clear pattern in the distribution and if there is a specific pattern, the data is heteroscedastic. The leftmost graph shows no definite pattern among the error terms i.e the distribution is varied constantly, whereas the middle graph shows a pattern where the error decreases and then increases with the estimated values violating the constant variance rule and the rightmost graph also reveals a specific pattern where the error terms decrease with the predicted values representing heteroscedasticity. Two or more normal distributions are homoscedastic if they share a common covariance (or correlation) matrix.

3. Multivariate Normality— It is assumed that the error terms are normally distributed, i.e. the mean of error terms is zero and the sum of error terms is also equal to zero. A less widely known fact is that, as the sample size goes high, the normality assumption for the residuals is not needed anymore.

Assumptions in Linear Regression you might not know.

The above q-q plot shows that the errors or residuals are normally distributed. The error term can be seen as the composite of some minor residuals or errors. As the number of these minor residuals increases, the distribution of the error term tends to approach the normal distribution. This tendency is called the Central Limit Theorem where the t-test and F- test are only applicable if the error term is normally distributed.

4. No Multicollinearity— Multicollinearity is defined as the degree of inter-correlations among the independent variables used in the model. It is assumed that the independent feature variables are not at all or very less correlated among each other, which makes them independent. So in practical implementation, the correlation between two independent features must not be greater than 30% as it weakens the statistical power of the model built. For identification of highly correlated features, pair plots (scatter plot) and heatmaps (correlation matrix) can be used.

Assumptions in Linear Regression you might not know.

Correlation Heatmap — Image by Author

Highly correlated features should not be used in the model to maintain the strong relationship between the model and all its features present as the features tend to change in unison. Hence, with the change in one feature, the change in correlated feature does not make the latter constant as the model requires it while predicting the outcome using the weighted coefficients and the expected interpretation of regression coefficient does not conform.

5. No Auto-correlation— It is assumed that there should be no auto-correlation among the features in the data. It mainly occurs when there is a dependency between residual errors, i.e. the residual error should not be correlated positively or negatively, and it should have a good spread all over. This usually occurs in time series models where the next instant is dependent on the previous instant. The presence of correlation in the residual terms also reduces the model’s predictability.

Autocorrelation can be tested with the help of the Durbin-Watson test. The Durbin-Watson test statistics is defined as:

Assumptions in Linear Regression you might not know.
Durbin-Watson Equation

The Durbin-Watson test statistics will always have a value between 0 and 4. An exact value of 2.0 states that there is no autocorrelation detected in the sample. Values between 0 and 2 indicate positive autocorrelation and values between 2 and 4 indicates negative autocorrelation.

6. No Extrapolation— Extrapolation is an estimation that can exist beyond the original observation range. It is assumed that the trained model will be able to predict the values for the dependent variable on independent feature values only for the data that lies within the range of the training data. Therefore, the model cannot guarantee the predicted values that are beyond the range of trained independent feature values.

Assumptions in Linear Regression you might not know.

Image by Author — Modified

Conclusion:

We have explained the most important assumptions which must be focussed before implementing a Linear Regression Model to a given set of data. These assumptions are just a formal measure to ensure that the predictability of the built linear regression model is good enough to give us the best possible results for a given data set. These assumptions if not satisfied will not stop a Linear regression model to be built but will provide good confidence to the predictability of the model.


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Perl语言编程

Perl语言编程

[美] Larry Wall、Tom Christiansen、Jon Orwant / 何伟平 / 中国电力出版社 / 2001-12 / 129.00元

这不仅仅是一本关于Perl的书籍,更是一本独一无二的开发者自己介绍该语言及其文化的书籍。Larry Wall是Perl的开发者,他就这种语言的未来发展方向提出了自己的看法。Tom Christiansen是最早的几个拥护者之一,也是少数几个在错综复杂的中游刃有余的人之一。Jon Orwant是《Perl Journal》的主编,该杂志把Perl社区组合成了一个共同的论坛,以进行Perl新的开发。一起来看看 《Perl语言编程》 这本书的介绍吧!

在线进制转换器
在线进制转换器

各进制数互转换器

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换

正则表达式在线测试
正则表达式在线测试

正则表达式在线测试