Understanding Maximum Likelihood Estimation (MLE)
What Is It? And What Is It Used For?
T he first time I learned MLE, I remember just thinking, “Huh?” It sounded more philosophical and idealistic than practical. But it turns out that MLE is actually quite practical and is a critical component of some widely used data science tools like logistic regression.
Let’s go over how MLE works and how we can use it to estimate the betas of a logistic regression model.
What Is MLE?
At its simplest, MLE is a method for estimating parameters. Every time we fit a statistical or machine learning model, we are estimating parameters. A single variable linear regression has the equation:
Y = B0 + B1*X
Our goal when we fit this model is to estimate the parameters B0 and B1 given our observed values of Y and X. We use Ordinary Least Squares (OLS), not MLE, to fit the linear regression model and estimate B0 and B1. But similar to OLS, MLE is a way to estimate the parameters of a model, given what we observe.
MLE asks the question, “Given the data that we observe (our sample), what are the model parameters that maximize the likelihood of the observed data occurring?”
A Simple Example
That’s quite a mouthful. Let’s use a simple example to show what we mean. Say we have a covered box containing an unknown number of red and black balls. If we randomly choose 10 balls from the box with replacement, and we end up with 9 black ones and only 1 red one, what does that tell us about the balls in the box?
Let’s say we start out believing there to be an equal number of red and black balls in the box, what’s the probability of observing what we observed?
Probability of drawing 9 black and 1 red (assuming 50% are black):We can do this 10 possible ways (see picture below).Each of the 10 has probability = 0.5^2 = 0.097%Since there are 10 possible ways, we multiply by 10:Probability of 9 black and 1 red = 10 * 0.097% = 0.977%
We can confirm this with some code too (I always prefer simulating over calculating probabilities):
In:import numpy as np# Simulate drawing 10 balls 100000 times to see how frequently
# we get 9
trials = [np.random.binomial(10, 0.5) for i in range(1000000)]
print('Probability = ' + str(round(float(sum([1 for i\
in trials if i==9]))\
/len(trials),5)*100) + '%')Out:Probability = 0.972%
The simulated probability is really close to our calculated probability (they’re not exact matches because the simulated probability has variance).
So our takeaway is that the likelihood of picking out as many black balls as we did, assuming that 50% of the balls in the box are black, is extremely low. Being reasonable folks, we would hypothesize that the percentage of balls that are black must not be 50%, but something higher. Then what’s the percentage?
This is where MLE comes in. Recall that MLE is a way for us to estimate parameters. The parameter in question is the percentage of balls in the box that are black colored.
MLE asks what should this percentage be to maximize the likelihood of observing what we observed (pulling 9 black balls and 1 red one from the box).
We can use Monte Carlo simulation to explore this. The following block of code loops through a range of probabilities (the percentage of balls in the box that are black). For each probability, we simulate drawing 10 balls 100,000 times in order to see how often we end up with 9 black ones and 1 red one.
# For loop to simulate drawing 10 balls from box 100000 times where # each loop we try a different value for the percentage of balls # that are blacksims = 100000black_percent_list = [i/100 for i in range(100)] prob_of_9 = []# For loop that cycles through different probabilities for p in black_percent_list: # Simulate drawing 10 balls 100000 times to see how frequently # we get 9 trials = [np.random.binomial(10, p) for i in range(sims)] prob_of_9.append(float(sum([1 for i in trials if i==9]))/len(trials))plt.subplots(figsize=(7,5)) plt.plot(prob_of_9) plt.xlabel('Percentage Black') plt.ylabel('Probability of Drawing 9 Black, 1 Red') plt.tight_layout() plt.show() plt.savefig('prob_of_9', dpi=150)
We end up with the following plot:
See that peak? That’s what we’re looking for.The value of percentage black where the probability of drawing 9 black and 1 red ball is maximized is its maximum likelihood estimate — the estimate of our parameter (percentage black) that most conforms with what we observed .
So MLE is effectively performing the following:
- Write a probability function that connects the probability of what we observed with the parameter that we are trying to estimate: we can write ours as P(9 black, 1 red | percentage black=b) — the probability of drawing 9 black and 1 red balls given that the percentage of balls in the box that are black is equal to b.
- Then we find the value of b that maximizes P(9 black, 1 red | percentage black=b) .
It’s hard to eyeball from the picture but the value of percentage black that maximizes the probability of observing what we did is 90%. Seems obvious right? And while this result seems obvious to a fault, the underlying fitting methodology that powers MLE is actually very powerful and versatile.
MLE and Logistic Regression
Now that we know what it is, let’s see how MLE is used to fit a logistic regression ( if you need a refresher on logistic regression, check out my previous post here ) .
The outputs of a logistic regression are class probabilities. In my previous blog on it, the output was the probability of making a basketball shot. But our data comes in the form of 1s and 0s, not probabilities. For example, if I shot a basketball 10 times from varying distances, my Y variable, the outcome of each shot, would look something like (1 represents a made shot):
y = [0, 1, 0, 1, 1, 1, 0, 1, 1, 0]
And my X variable, the distance (in feet) from the basket of each shot, would look like:
X = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
How can we go from 1s and 0s to probabilities? We can think of each shot as the outcome of a binomially distributed random variable ( for more on the binomial distribution, read my previous article here ). In plain English, this means that each shot is its own trial (like a single coin toss) with some underlying probability of success. Except that we are not just estimating a single static probability of success; rather we are estimating the probability of success conditional on how far we are from the basket when we shoot the ball.
So we can reframe our problem as a conditional probability (y = the outcome of the shot):
P(y | Distance from Basket)
In order to use MLE, we need some parameters to fit. In a single variable logistic regression, those parameters are the regression betas: B0 and B1. In the equation below, Z is the log odds of making a shot ( if you don’t know what this means, it’s explained here ).
Z = B0 + B1*X
You can think of B0 and B1 as hidden parameters that describe the relationship between distance and the probability of making a shot.For certain values of B0 and B1, there might be a strongly positive relationship between shooting accuracy and distance. For others, it might be weakly positive or even negative (Steph Curry). If B1 was set to equal 0, then there would be no relationship at all:
For each set of B0 and B1, we can use Monte Carlo simulation to figure out the probability of observing the data. The probability we are simulating for is the probability of observing our exact shot sequence (y=[0, 1, 0, 1, 1, 1, 0, 1, 1, 0], given that Distance from Basket=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) for a guessed set of B0, B1 values.
P(y=[0, 1, 0, 1, 1, 1, 0, 1, 1, 0] | Dist=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) for a given B0 and B1
By trying a bunch of different values, we can find the values for B0 and B1 that maximize P(y=[0, 1, 0, 1, 1, 1, 0, 1, 1, 0] | Dist=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) . Those would be the MLE estimates of B0 and B1.
Obviously in logistic regression and with MLE in general, we’re not going to be brute force guessing. Rather, we create a cost function that is basically an inverted form of the probability that we are trying to maximize. This cost function is inversely proportional to P(y=[0, 1, 0, 1, 1, 1, 0, 1, 1, 0] | Dist=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) and like it, the value of the cost function varies with our parameters B0 and B1. We can find the optimal values for B0 and B1 by using gradient descent to minimize this cost function.
But in spirit, what we are doing as always with MLE, is asking and answering the following question:
Given the data that we observe, what are the model parameters that maximize the likelihood of the observed data occurring?
I referred to the following articles in this post:
以上所述就是小编给大家介绍的《Understanding Maximum Likelihood Estimation (MLE)》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。