From Zero to Linear Regression

栏目: IT技术 · 发布时间: 3年前

内容简介:What isIf you’re interested read on, if you’re not, see yourself out. :door:This article is the first of a series covering fundamental machine learning algorithms. Each post will be split into two parts

Linear Regression

3 Jun 2020 - 1490 Words

From zero to linear regression

Follow @SimonWardJones

What is linear regression ‍♂️?

If you’re interested read on, if you’re not, see yourself out. :door:

This article is the first of a series covering fundamental machine learning algorithms. Each post will be split into two parts

  1. The idea and key concepts - Most people should be able to follow this section and learn how the algorithm works
  2. - This is for the interested reader and will include detailed mathematical derivations followed by an implementation in Python :snake:

The idea and key concepts

Regression is any algorithm that takes a collection of inputs and predicts an output. I’ll forgive you if you still don’t follow.

Let’s give an example and this should make much more sense. Let’s say we are trying to predict how much a house :house_with_garden: will sell for in a desirable area of North London. We may know a few facts about the house e.g. the square footage of the house :house:, the square footage of the garden :deciduous_tree: and whether there is a garage :blue_car:.

In machine learning we call these facts variables or features . Intuitively we may value the house using a combination of these features. :house::deciduous_tree::blue_car:

Linear regression uses a linear combination of the features to predict the output. This just means summing up each feature value multiplied by a number (called a coefficient ) to represent how important that feature is e.g

$$ \begin{aligned} \text{House Price} &= (£1000 * \text{:house:}) \\ & + (£50 * \text{:deciduous_tree:}) \\ & + (£1000 * \text{:blue_car:?}) \end{aligned} $$

In this example the house price is calculated as £$1000$ for each square foot of the house plus $£50$ for each foot in the garden plus $£1000$ if there is a garage. If we took a nice $600$ square foot property with a small $500$ square foot garden and no garage our linear regression model would predict the house is worth $£625,000$.

$$ \begin{aligned} \text{House Price} &= (£1000 * \text{:house:}) + (£50 * \text{:deciduous_tree:}) + (£1000 * \text{:blue_car: ?}) \\ &= (£1000 * 600) + (£50 * 500) + (£1000 * 0)\\ &= £625,000 \end{aligned} $$

Now that’s how the linear regression model works! The question now is how do you choose the coefficients for the features, in our example the $1000$ the $50$ and the $1000$? ‍♂️

A common type of machine learning algorithm called supervised learning algorithms find these parameters using training data . Training data is just examples where you already know the answer. In our case it is a list of houses where we know both the house features :house::deciduous_tree::blue_car: and the house prices :moneybag:in advance. Here is an example of three training examples with the model predictions:

House size :house: Garden size :deciduous_tree: Garage? :blue_car: True House Price :moneybag: Predicted house price :moneybag:
1000 700 Garage £1m £ 1.036m
770 580 No Garage £0.75m £0.799m
660 200 Garage £0.72m £0.671m

Linear regression is a supervised learning algorithm which means it works out the best coefficients using the training data. It does this by iteratively changing the coefficients to reduce the error between the predictions and the true house prices. The specifics of this is process is more complex (detailed in the next section).

In short the linear regression algorithm chooses the coefficients to minimise the average error when predicting the house price for all the training examples.

And that’s linear regression!

The nitty gritty

In order to keep this as accessible as possible to people less familiar with mathematical concepts and notation I will include footnotes to explain where I think it may help.

The model

If we let $y$ represent a single continuous target variable and $x_1,\dots,x_n$ (where $n \in \mathbb{N}$and $x_0 = 1$) represent the feature values. Then the linear model can be written as

$$ \begin{align} \hat{y}&=\beta_0x_0+\cdots+\beta_nx_n \\ \hat{y}&=\sum^{n}_{i=0}\beta_ix_i \end{align} $$

A hat above a variable is often used to represent a prediction of the true value. Here the y hat represents the linear model prediction.

The cost function

We define below the cost (a.k.a. error or loss ) function $J$ as half of the mean square error for the $m$ training samples where $m \in \mathbb{N}$. We use a superscript to represent each training example so $y^j$ is the $j$th training data target value and $x_i^j$ is the $i$th feature value of the $j$th training example.

$$ \begin{align} J(\boldsymbol{\beta}) &= \frac{1}{2m}\sum^{m}_{j=1}\left( y^j-\hat{y}^j \right)^2\\ &= \frac{1}{2m}\sum^{m}_{j=1}\left( y^j-\sum^{n}_{i=0}\beta_ix_i^j \right)^2 \end{align} $$

Gradient descent

In order to find the coefficients that lead to the minimal cost we use gradient descent . The gradient of the cost function tells you in which direction to change your coefficients in order to reduce the value of the cost function the most. The gradient is defined as the vector of partial derivatives with respect to each coefficient so let’s first calculate these:

$$ \begin{align} \frac{\partial J}{\partial\beta_k}\left(\boldsymbol{\beta}\right) &= \frac{\partial}{\partial\beta_k}\left( \frac{1}{2m}\sum^{m}_{j=1} \left( y^j-\sum^{n}_{i=0}\beta_ix_i^j \right)^2 \right)\\ &= \frac{1}{m}\sum^{m}_{j=1} \left( y^j-\sum^{n}_{i=0}\beta_ix_i^j \right)(-x^j_k)\\ \end{align} $$

Now we can write the gradient as:

$$ \begin{aligned} \nabla_{\boldsymbol{\beta}} J &= \begin{bmatrix} \frac{\partial J}{\partial\beta_0} \\ \vdots \\ \frac{\partial J}{\partial\beta_n} \end{bmatrix} \\ &= \begin{bmatrix} -\frac{1}{m}\sum^{m}_{j=1} \left(y^j-\sum^{n}_{i=0}\beta_ix_i^j\right)x^j_0\\ \vdots \\ -\frac{1}{m}\sum^{m}_{j=1} \left(y^j-\sum^{n}_{i=0}\beta_ix_i^j\right)x^j_n\\ \end{bmatrix} \end{aligned} $$

We could calculate the above gradient using the sums defined but it is more efficient for implementing if we vectorise the calculation.

Vectorise

For this we define the design matrix $\mathbf{X}$ by stacking the $m$ training examples on top of each other, so each row of $\mathbf{X}$ represents one training example and the columns represent the different features. We also define $\mathbf{y}$ the vector of target values by stacking the $m$ target variables on top of each other. Finally we also define the vector of $n+1$ coefficients $\boldsymbol{\beta}$. Where:

$$ \mathbf{X}\in\mathbb{R}^{m\times (n+1)}, \quad \mathbf{y}\in\mathbb{R}^{m\times 1}, \quad \boldsymbol{\beta}\in\mathbb{R}^{(n+1)\times1} $$

Or more visually

$$ \mathbf{X}=\begin{bmatrix} 1 & x_1^1 & x_2^1 & \dots & x_n^1 \\ 1 & x_1^2 & x_2^2 & \dots & x_n^2 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_1^m & x_2^m & \dots & x_n^m \\ \end{bmatrix} $$

and

$$ \mathbf{y}=\begin{bmatrix} y_1\\y_2\\\vdots\\y_m \end{bmatrix} \quad \boldsymbol{\beta} = \begin{bmatrix} \beta_0\\\beta_1\\\vdots\\\beta_n \end{bmatrix} $$

Now if we take the above gradient we derived above and re-write it splitting the terms we see

$$ \nabla_{\boldsymbol{\beta}} J =-\frac{1}{m} \begin{bmatrix} \sum^{m}_{j=1}y^jx^j_0\\ \vdots \\ \sum^{m}_{j=1}y^jx^j_n\\ \end{bmatrix}+ \frac{1}{m} \begin{bmatrix} \sum^{m}_{j=1}\sum^{n}_{i=0}\beta_ix_i^jx^j_0\\ \vdots \\ \sum^{m}_{j=1}\sum^{n}_{i=0}\beta_ix_i^jx^j_n \end{bmatrix}\\ $$

From this (I hope you can convince yourself, assuming you know matrix multiplication) we can write

$$ \begin{align} \nabla_{\boldsymbol{\beta}} J &=\frac{1}{m}\left( \mathbf{X}^T\mathbf{X}\mathbf{\boldsymbol{\beta}}-\mathbf{X}^T\mathbf{y} \right)\\ &=\frac{1}{m}\mathbf{X}^T\left( \mathbf{X}\mathbf{\boldsymbol{\beta}}-\mathbf{y} \right)\\ &=\frac{1}{m}\mathbf{X}^T\left( \mathbf{\hat{y}}-\mathbf{y} \right) \end{align} $$

Where $\mathbf{\hat{y}} = \mathbf{X}\mathbf{\boldsymbol{\beta}}$ are the predictions of the linear model.

Now that we have derived the gradient formula :tada: let’s implement gradient descent in python :snake: to iteratively step towards the optimal coefficients.

Python implementation

We will build the implementation in an object oriented fashion defining a class for Linear regression. For the full code (with doc strings) it’s on github here .

class LinearRegression():

Next we define the init method on the class setting the learning rate . Remember the gradient tells you in which direction to change the coefficients. The gradient descent algorithm repeatedly updates the coefficients by stepping in the direction of negative gradient . The size of the step is governed by the learning rate.

def __init__(self, learning_rate=0.05):
    self.learning_rate = learning_rate
    print('Creating linear model instance')

Next let’s define a method for the cost function as defined above

def cost(self, y, y_pred):
    m = y.shape[0]
    cost = (1 / (2 * m)) * \
        (y - y_pred).T @ (y - y_pred)
    return cost

Next let’s define a method to calculate the gradient of the cost function

def gradient(self, y, y_pred, X):
    m = X.shape[0]
    gradient = (1 / m) * \
         X.T @ (y_pred - y)
    return gradient

Before we define the fit method to implement gradient descent let’s define one last method which predicts the target variable $y$ given the current coefficients and features $X$

def predict(self, X):
    y_pred = X @ self.beta
    return y_pred

And finally here is the fit method implementing n_iter iterations of gradient descent.

def fit(self, X, y, n_iter=1000):
    m, n = X.shape
    print(f'fitting with m={m} samples with n={n-1} features\n')
    self.beta = np.zeros(shape=(n, 1))
    self.costs = []
    self.betas = [self.beta]
    for iteration in range(n_iter):
        y_pred = self.predict(X)
        cost = self.cost(y, y_pred)
        self.costs.append(cost[0][0])
        gradient = self.gradient(y, y_pred, X)
        self.beta = self.beta - (
            self.learning_rate * gradient)
        self.betas.append(self.beta)

And that’s it. Here’s an example use of the class:

linear_regression = LinearRegression()
linear_regression.fit(X, y)

linear_regression.predict(X_new)

Thanks for reading! :clap: Please get in touch with any questions, mistakes or improvements.


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

谁排第一

谁排第一

Amy N. Langville、Carl D. Meyer / 郭斯羽 / 机械工业出版社 / 2014-6 / 49

《谁排第一?关于评价和排序的科学》是首个关于评分和排名科学的著作。它是搜索排序姊妹篇的第二本。本书主要内容有:排名概述、梅西法、科利法、基纳法、埃洛体系、马尔可夫法、攻防评分法、基于重新排序的排名方法、分差、用户偏好评分、处理平局、加入权重、“假如……会怎样”的问题与敏感性、排名聚合、比较排名的方法、数据等。 《谁排第一?关于评价和排序的科学》可作为数学、计算机、网络技术、管理学和数据科学等......一起来看看 《谁排第一》 这本书的介绍吧!

HTML 压缩/解压工具
HTML 压缩/解压工具

在线压缩/解压 HTML 代码

图片转BASE64编码
图片转BASE64编码

在线图片转Base64编码工具

XML 在线格式化
XML 在线格式化

在线 XML 格式化压缩工具