Understanding Maximum Likelihood Estimation (MLE)

栏目: IT技术 · 发布时间: 4年前

Understanding Maximum Likelihood Estimation (MLE)

Photo by Markus Spiske on Unsplash

Understanding Maximum Likelihood Estimation (MLE)

What Is It? And What Is It Used For?

Understanding Maximum Likelihood Estimation (MLE)

Mar 7 ·7min read

T he first time I learned MLE, I remember just thinking, “Huh?” It sounded more philosophical and idealistic than practical. But it turns out that MLE is actually quite practical and is a critical component of some widely used data science tools like logistic regression.

Let’s go over how MLE works and how we can use it to estimate the betas of a logistic regression model.

What Is MLE?

At its simplest, MLE is a method for estimating parameters. Every time we fit a statistical or machine learning model, we are estimating parameters. A single variable linear regression has the equation:

Y = B0 + B1*X

Our goal when we fit this model is to estimate the parameters B0 and B1 given our observed values of Y and X. We use Ordinary Least Squares (OLS), not MLE, to fit the linear regression model and estimate B0 and B1. But similar to OLS, MLE is a way to estimate the parameters of a model, given what we observe.

MLE asks the question, “Given the data that we observe (our sample), what are the model parameters that maximize the likelihood of the observed data occurring?”

A Simple Example

That’s quite a mouthful. Let’s use a simple example to show what we mean. Say we have a covered box containing an unknown number of red and black balls. If we randomly choose 10 balls from the box with replacement, and we end up with 9 black ones and only 1 red one, what does that tell us about the balls in the box?

Let’s say we start out believing there to be an equal number of red and black balls in the box, what’s the probability of observing what we observed?

Probability of drawing 9 black and 1 red (assuming 50% are black):We can do this 10 possible ways (see picture below).Each of the 10 has probability = 0.5^2 = 0.097%Since there are 10 possible ways, we multiply by 10:Probability of 9 black and 1 red = 10 * 0.097% = 0.977%

Understanding Maximum Likelihood Estimation (MLE)

10 possible ways to draw 1 red ball and 9 black ones

We can confirm this with some code too (I always prefer simulating over calculating probabilities):

In:import numpy as np# Simulate drawing 10 balls 100000 times to see how frequently
# we get 9
trials = [np.random.binomial(10, 0.5) for i in range(1000000)]
print('Probability = ' + str(round(float(sum([1 for i\
in trials if i==9]))\
/len(trials),5)*100) + '%')
Out:Probability = 0.972%

The simulated probability is really close to our calculated probability (they’re not exact matches because the simulated probability has variance).

So our takeaway is that the likelihood of picking out as many black balls as we did, assuming that 50% of the balls in the box are black, is extremely low. Being reasonable folks, we would hypothesize that the percentage of balls that are black must not be 50%, but something higher. Then what’s the percentage?

This is where MLE comes in. Recall that MLE is a way for us to estimate parameters. The parameter in question is the percentage of balls in the box that are black colored.

MLE asks what should this percentage be to maximize the likelihood of observing what we observed (pulling 9 black balls and 1 red one from the box).

We can use Monte Carlo simulation to explore this. The following block of code loops through a range of probabilities (the percentage of balls in the box that are black). For each probability, we simulate drawing 10 balls 100,000 times in order to see how often we end up with 9 black ones and 1 red one.

# For loop to simulate drawing 10 balls from box 100000 times where
# each loop we try a different value for the percentage of balls 
# that are blacksims = 100000black_percent_list = [i/100 for i in range(100)]
prob_of_9 = []# For loop that cycles through different probabilities
for p in black_percent_list:
    # Simulate drawing 10 balls 100000 times to see how frequently
    # we get 9
    trials = [np.random.binomial(10, p) for i in range(sims)]
    prob_of_9.append(float(sum([1 for i in trials if i==9]))/len(trials))plt.subplots(figsize=(7,5))
plt.plot(prob_of_9)
plt.xlabel('Percentage Black')
plt.ylabel('Probability of Drawing 9 Black, 1 Red')
plt.tight_layout()
plt.show()
plt.savefig('prob_of_9', dpi=150)

We end up with the following plot:

Understanding Maximum Likelihood Estimation (MLE)

Probabilities of drawing 9 black and 1 red balls

See that peak? That’s what we’re looking for.The value of percentage black where the probability of drawing 9 black and 1 red ball is maximized is its maximum likelihood estimate — the estimate of our parameter (percentage black) that most conforms with what we observed .

So MLE is effectively performing the following:

  • Write a probability function that connects the probability of what we observed with the parameter that we are trying to estimate: we can write ours as P(9 black, 1 red | percentage black=b) — the probability of drawing 9 black and 1 red balls given that the percentage of balls in the box that are black is equal to b.
  • Then we find the value of b that maximizes P(9 black, 1 red | percentage black=b) .

It’s hard to eyeball from the picture but the value of percentage black that maximizes the probability of observing what we did is 90%. Seems obvious right? And while this result seems obvious to a fault, the underlying fitting methodology that powers MLE is actually very powerful and versatile.

MLE and Logistic Regression

Now that we know what it is, let’s see how MLE is used to fit a logistic regression ( if you need a refresher on logistic regression, check out my previous post here ) .

The outputs of a logistic regression are class probabilities. In my previous blog on it, the output was the probability of making a basketball shot. But our data comes in the form of 1s and 0s, not probabilities. For example, if I shot a basketball 10 times from varying distances, my Y variable, the outcome of each shot, would look something like (1 represents a made shot):

y = [0, 1, 0, 1, 1, 1, 0, 1, 1, 0]

And my X variable, the distance (in feet) from the basket of each shot, would look like:

X = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

How can we go from 1s and 0s to probabilities? We can think of each shot as the outcome of a binomially distributed random variable ( for more on the binomial distribution, read my previous article here ). In plain English, this means that each shot is its own trial (like a single coin toss) with some underlying probability of success. Except that we are not just estimating a single static probability of success; rather we are estimating the probability of success conditional on how far we are from the basket when we shoot the ball.

So we can reframe our problem as a conditional probability (y = the outcome of the shot):

P(y | Distance from Basket)

In order to use MLE, we need some parameters to fit. In a single variable logistic regression, those parameters are the regression betas: B0 and B1. In the equation below, Z is the log odds of making a shot ( if you don’t know what this means, it’s explained here ).

Z = B0 + B1*X

You can think of B0 and B1 as hidden parameters that describe the relationship between distance and the probability of making a shot.For certain values of B0 and B1, there might be a strongly positive relationship between shooting accuracy and distance. For others, it might be weakly positive or even negative (Steph Curry). If B1 was set to equal 0, then there would be no relationship at all:

Understanding Maximum Likelihood Estimation (MLE)

The effect of different B0 and B1 parameters on probability

For each set of B0 and B1, we can use Monte Carlo simulation to figure out the probability of observing the data. The probability we are simulating for is the probability of observing our exact shot sequence (y=[0, 1, 0, 1, 1, 1, 0, 1, 1, 0], given that Distance from Basket=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) for a guessed set of B0, B1 values.

P(y=[0, 1, 0, 1, 1, 1, 0, 1, 1, 0] | Dist=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) for a given B0 and B1

By trying a bunch of different values, we can find the values for B0 and B1 that maximize P(y=[0, 1, 0, 1, 1, 1, 0, 1, 1, 0] | Dist=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) . Those would be the MLE estimates of B0 and B1.

Obviously in logistic regression and with MLE in general, we’re not going to be brute force guessing. Rather, we create a cost function that is basically an inverted form of the probability that we are trying to maximize. This cost function is inversely proportional to P(y=[0, 1, 0, 1, 1, 1, 0, 1, 1, 0] | Dist=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) and like it, the value of the cost function varies with our parameters B0 and B1. We can find the optimal values for B0 and B1 by using gradient descent to minimize this cost function.

But in spirit, what we are doing as always with MLE, is asking and answering the following question:

Given the data that we observe, what are the model parameters that maximize the likelihood of the observed data occurring?

I referred to the following articles in this post:

Understanding Logistic Regression

The Binomial Distribution


以上所述就是小编给大家介绍的《Understanding Maximum Likelihood Estimation (MLE)》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

计算机算法基础

计算机算法基础

余祥宣、崔国华、邹海明 / 华中科技大学出版社 / 2006-4 / 29.80元

《计算机算法基础》围绕算法设计的基本方法,对计算机领域中许多常用的非数值算法作了精辟的描述,并分析了这些算法所需的时间和空间。《计算机算法基础》可作为高等院校与计算机有关的各专业的教学用书,也可作为从事计算机科学、工程和应用的工作人员的自学教材和参考书。一起来看看 《计算机算法基础》 这本书的介绍吧!

CSS 压缩/解压工具
CSS 压缩/解压工具

在线压缩/解压 CSS 代码

图片转BASE64编码
图片转BASE64编码

在线图片转Base64编码工具

XML 在线格式化
XML 在线格式化

在线 XML 格式化压缩工具