Build, Evaluate, and Interpret Your Own Linear Regression Model in Minutes

栏目: IT技术 · 发布时间: 3年前

内容简介:Regression is central to so much of the statistical analysis & machine learning tools that we leverage as data scientists.Stated simply, we utilize regression techniques to model Y through some function of X. Deriving that function of X often heavily depen

Build, Evaluate, and Interpret Your Own Linear Regression Model in Minutes

Image by Martin Winkler from Pixabay

Introduction to Linear Regression

Regression is central to so much of the statistical analysis & machine learning tools that we leverage as data scientists.

Stated simply, we utilize regression techniques to model Y through some function of X. Deriving that function of X often heavily depends on linear regression and is the basis of us explanation or prediction.

Let’s dive right in by taking a look at modeling some numeric variable Y by some numeric explanatory variable, X.

Regression with a Numeric Explanatory Variable

I have pulled down a house prices dataset from kaggle. You can find that here: https://www.kaggle.com/shree1992/housedata/data

Below you’ll see a scatter plot between the sqft living space of a home and its price.

In additional to that scatter plot, I also include a regression line. More on that in a moment.

housing %>%
    ggplot(aes(x = sqft_living, y = price)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE)

Build, Evaluate, and Interpret Your Own Linear Regression Model in Minutes

What you can see above is that these two data points are indeed correlated, but you’ll also notice that the trend line moves right through the middle of these datapoints.

Lets talk about how regression works and in this case how ordinary least squares regression (OLS) works.

What we’re seeing here is a line, a line that has a y-intercept and a slope. When it comes to slope you can also think rise over run!

Now I want to highlight that there is an objective function that determines the placement of said line.

The line will be placed where the absolute distance from the line and the surrounding datapoints is least. In other words, if you place that y-intercept a little higher, or increase the slope of the line… the absolute distance between the actuals and the prediction would go up. Hence the rationale for the positioning of the line right down the middle of the group… where error is smallest.

Correlation versus Causation

Now, we’ve observed a relationship between two variables that are positively correlated.

With that said, can we conclude that x causes y? Certainly not! If you remembered that from college statistics, give yourself a pat on the back. Obviously there could be any number of other factors at play.

To call on the notion of the general modeling framework, when we build a linear model, we are creating a linear function or a line .

The purpose of this line is to allow us to either explain or predict.

Whatever the case, modeling a line requires a y intercept and a slope.

In another post, I speak about the general modeling framework is Y as some function of X + epsilon or error. In the case of the equation of the line, you may ask yourself where epsilon is… and the case is that we don’t represent epsilon in our equation of a line or linear function, as the sole purpose of the model is to capture signal , not noise.

Interpreting Your Regression Model Output

We’ll first run the lm function in R. This function builds a simple linear model as determined by the formula you pass it.

y ~ x, or in this case, price as a function of sqft living.

fit <- lm(price ~ sqft_living, data = housing)

Build, Evaluate, and Interpret Your Own Linear Regression Model in Minutes

You can see in the above output our call but also this coefficients section.

This section highlight our equation of a line. The y intercept is 12954 and our coefficient for our explanatory variable, sqft_living is 252. The way to interpret that coefficient is that for every 1 unit increase to sqft_living , we should see a 252 unit increase in price .

My house is about 3000 sqft, so according to this equation of a line, if you plopped my house down in Seattle, we’d predict its value to be $12,954 + $252*3000 = $768K… needless to say, all of this data is based on the housing market… my home is not nearly that valuable.

With this example behind us, one thing to keep in mind, is it’s the slope or coefficient that we can rely on to quantify the relationship between x and y.

Diving Deeper into Your Regression Output

We are going to dive deeper into the nitty gritty of your linear model. We’ll do so a couple different ways, but the first will be with the classic summary function in R.

summary(fit)

With a call as simple as that we get the following regression output.

Build, Evaluate, and Interpret Your Own Linear Regression Model in Minutes

Let’s go from the top!

First things first, the call, makes sense. We get some stats in the residuals, or in other words the error, but we want dive deep into that for now.

Next we see the coefficients as we saw before in a slightly different format.

A couple things I want to point you to are the idea of R-squared and p-value… two of the most mis-used statistics terms out there.

R-squared is defined as the amount of variation in Y that can be explained by variation in X.

P-value is the traditional measure of statistical significance. The key takeaway here, is that p-value serves to tell us the likelihood a given output might just be random noise. In other words, the likelihood of a given occurrence happening randomly is 5% or less and as such it is statistically significant.

Another glance of some similar outputs is passing our model to the get_regression_table in the moderndive package.

get_regression_table(fit)
Build, Evaluate, and Interpret Your Own Linear Regression Model in Minutes

get_regression_table serves as a quick wrapper to the model that is able to display conveniently some of the more important statistics about our model.

Conclusion

Hopefully this proved to be a useful introduction to linear regression. How to build and how to interpret them.

Recap

Today we got a crash course in the following:

  • visualizing the relationship between a Y and an X
  • adding regression lines to our Y & X visualizations
  • building a linear regression model
  • evaluating said model through an understanding of its statistical significance through p-value or the amount of variation in Y we can explain through the variation in x.

If this was useful come check out the rest of my posts at datasciencelessons.com! As always, Happy Data Science-ing!


以上所述就是小编给大家介绍的《Build, Evaluate, and Interpret Your Own Linear Regression Model in Minutes》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

算法设计与分析基础

算法设计与分析基础

Anany Levitin / 潘彦 / 清华大学出版社 / 2015-2-1 / 69.00元

作者基于丰富的教学经验,开发了一套全新的算法分类方法。该分类法站在通用问题求解策略的高度,对现有大多数算法准确分类,从而引领读者沿着一条清晰、一致、连贯的思路来探索算法设计与分析这一迷人领域。《算法设计与分析基础(第3版)》作为第3版,相对前版调整了多个章节的内容和顺序,同时增加了一些算法,并扩展了算法的应用,使得具体算法和通用算法设计技术的对应更加清晰有序;各章累计增加了70道习题,其中包括一些......一起来看看 《算法设计与分析基础》 这本书的介绍吧!

Base64 编码/解码
Base64 编码/解码

Base64 编码/解码

XML 在线格式化
XML 在线格式化

在线 XML 格式化压缩工具

正则表达式在线测试
正则表达式在线测试

正则表达式在线测试