OPTIMIZATION AND MACHINE LEARNING
What to Do When Your Model Has a Non-Normal Error Distribution
How to use warping to fit arbitrary error distributions
O ne of the most import things a model can tell us is how certain it is in a prediction. An answer to this question can come in the form of an error distribution. An error distribution is a probability distribution about a point prediction telling us how likely each error delta is.
The error distribution can be every bit as important than the point prediction.
Suppose you’re an investor considering two different opportunities (A and B) and using a model to predict the one-year returns (as a percentage of the amount invested). The model predicts A and B to have the same expected 1-year return of 10% but shows these error distributions
Even though both opportunities have the same expected return, the error distribution shows how different they are. B is tightly distributed about its expected value with little risk of losing money; whereas, A is more like a lottery. There’s a small probability of a high payout (~500% return); but the majority of the time, we’ll lose everything (~-100% return).
A point prediction tells us nothing about where target values are likely to be distributed. If it’s important to know how far off a prediction can be, or if target values can be clustered about fat tails , then an accurate error distribution becomes essential.
An easy way to get the error distribution wrong is to try to force it into a form it doesn’t take. This frequently happens when we reach for the convenient, but often misapplied, normal distribution.
The normal distribution is popular for good reason. In addition to making math easier, the central limit theorem tells us the normal distribution can be a natural choice for many problems.
How can a normal distribution come about naturally?
Let X denote a feature matrix and b denote a vector of regressors. Suppose target values are generated by the equation
where
The central limit theorem says that if the E ’s are independently identically distributed random variables with finite variance, then the sum will approach a normal distribution as m increases.
Even when E is wildly non-normal, e will be close to normal if the summation contains enough terms.
Let’s look at a concrete example. Set b = (-2, 3) . Let the entries of X be generated independently from the uniform distribution [-1, 1] . We’ll generate the E’ s from this decidedly non-normal distribution
We normalize the error distribution for e to have unit variance and allow the number of terms m to vary. Here are histograms of the errors (in orange) from least squares models taken from runs of a simulation for different values of m and overlaid with the expected histogram of the errors if they were normally distributed (in blue)¹.
With greater values of m, the error histogram gets increasingly closer to that of the normal distribution.
When there’s reason to think that error terms break down into sums of independent identically distributed factors like this, the normal distribution is a good choice. But in the general case, we have no reason to assume it. And indeed, many error distributions are not normal exhibiting skewing and fat tails .
What should we do when we have non-normality in an error distribution?
This is where warping helps us². It uses the normal distribution as a building block but gives us knobs to locally adjust the distribution to better fit the errors from the data.
To see how warping works, observe that if f(y) is a monotonically increasing surjective function and p(z) is a probability density function, then p(f(y))f′(y) forms a new probability density function.
because f′(y) ≥ 0; and after applying the substitution u=f(y) , we see that
Let’s look at an example to see how f can reshape a distribution. Suppose p(z) is the standard normal distribution N(0, 1) and f(y) is defined by
where c > 0; and between [0, 1], f is a spline that smoothly transitions between y and cy. Here’s what f looks like for a few different values of c
and here are what the resulting warped probability distributions look like³
When c = 2 , area is redistributed from the standard normal distribution so that the probability density function (PDF) peaks and then quickly falls off so as to have a thinner right tail. When c = 0.5 , the opposite happens: the PDF falls off quickly and then slows its rate of decline so as to have a fatter right tail.
Now, imagine f is parameterized by a vector ψ that allows us to make arbitrary localized adjustments to the rate of increase. (More on how to parameterize f later). Then with suitable ψ , f can fit a wide range of different distributions. If we can find a way to properly adjust ψ , then this will give us a powerful tool to fit error distributions.
How to adjust warping parameters?
A better fit error distribution makes the errors on the training data more likely. It follows that we can find warping parameters by maximizing likelihood on the training data.
First, let’s look at how maximizing likelihood works without warping.
Let θ denote the parameter vector for a given regression model. Let g(x; θ) represents the prediction of the model for feature vector x . If we use a normal distribution with standard deviation σ to model the error distribution of predictions, then the likelihood of the training data is
and the log-likelihood is
Put
(RSS stands for residual sum of squares )
For θ fixed, σ maximizes likelihood when
More generally, if σ² = cRSS ( c > 0 ), then the log-likelihood simplifies to
And we see that likelihood is maximized when θ minimize RSS.
Now, suppose we warp the target space with the monotonic function f parameterized by ψ. Let f(y; ψ) denote a warped target value. Then the likelihood with the warped error distribution is
and the log-likelihood becomes
Or with
and σ² = cRSS
To fit the error distribution, we’ll use an optimizer to find the parameters (θ, ψ) that maximize this likelihood.
For an optimizer to work, it requires a local approximation to the objective that it can use to iteratively improve on parameters. To build such an approximation, we’ll need to compute the gradient of the log-likelihood with respect to the parameter vector.
Put
We can use L as a proxy for the log-likelihood since it differs only by a constant.
Warping is a general process that can be applied to any base regression model, but we’ll focus on the simplest base model, linear regression.
How to warp a linear regression model?
With linear regression, we can derive a closed form for θ . Let Q and R be the matrices of the QR-factorization of the feature matrix X
where Q is orthogonal and R is rectangular triangular. Put
and let b̂ denote the vector that minimizes RSS for the warped targets z
Put
Then
If X has m linear independent columns, then the first m rows of the rectangular triangular matrix R have non-zero entries on the diagonal and the remaining rows are 0. It follows that
for i ≤m and
for i>m . Therefore,
Let P be the n x n diagonal matrix with
Set
Then
Substituting these equations into the log-likelihood proxy, we get
And differentiating with respect to a warping parameter gives us
Using these derivatives, an optimizer can climb to warping parameters ψ that maximize the likelihood of the training data.
How to make predictions with a warped linear regression model?
Now that we’ve found warping parameters, we need to make predictions.
Consider how this works in a standard ordinary least squares model without warping. Suppose data is generated from the model
where ε is in N(0, σ²). Let X and y denote the training data. The regressors that minimize the RSS of the training data are
If x′ and y ′ denote an out-of-sample feature vector and target value
then the error of the out-of-sample prediction is
Because ε and ε′ are normally distributed, it follows that e′ is normally distributed and the variance is⁴
We rarely know the noise variance σ², but we can use this equation to obtain an unbiased estimate for it
where p is the number of regressors.
Suppose now the ordinary least squares model is fitted to the warped target values
Ordinary least squares gives us a point prediction and error distribution for the latent space, but we need to invert the warping to get a prediction for the target space.
Let ẑ represent the latent prediction for an out-of-sample feature vector x′ . If s² is the estimated latent noise variance, then the probability of a target value y is
and the expected target value is
After making the substitution u=f(y) , the expected value can be rewritten as
The inverse of f can be computed using Newton’s method to find the root of f(y) − u, and the integral can be efficiently evaluated with a Gauss-Hermite quadrature .
What are some effective functions for warping?
Let’s turn our attention to the warping function f(y; ψ) and how to parameterize it. We’d like for the parameterization to allow for a wide range of different functions, but we also need to ensure that it only permits monotonically increasing surjective warping functions.
Observe that the warping function is invariant under rescaling: c f(y; ψ) leads to the same results as f(y; ψ) . Set θ′ so that g(x; θ ′ )=c g(x;θ). Then the log likelihood proxy L(ψ, θ′) for c f(y; ψ) is
What’s important is how the warping function changes the relative spacing between target values.
One effective family of functions for warping is
Each tanh step allows for a localized change to the warping function’s slope. The t term ensures that the warping function is monotonically surjective and reverts back to the identity when t is far from any step. And because of the invariance to scaling, it’s unnecessary to add a scaling coefficient to t .
We’ll make one additional adjustment so that the warping function zeros the mean. Put
An Example Problem
The Communities and Crime Dataset⁵ provides crime statistics for different localities across the United States. As a regression problem, the task is to predict the violent crime rate from different socio-economic indicators. We’ll fit a warped linear regression model to the dataset and compare how it performs to an ordinary least squares model.
Let’s look at the warping function fit to maximize the log-likelihood on the training data.
Let σ denote the estimated noise standard deviation in the latent space. To visualize how this function changes an error distribution, we’ll plot the range
across the target values
Warping makes a prediction’s error range smaller at lower target values⁶.
To see if warping leads to better results, let’s compare the performance of a warped linear regression model (WLR) to an ordinary least squares model (OLS) on a ten-fold cross-validation of the communities dataset. We use Mean Log-Likelihood (MLL) as the error measurement. MLL averages the log-likelihood of each out-of-sample prediction in the cross-validation⁷.
The results show warped linear regression performing substantially better. Drilling down on a few randomly chosen predictions and their error distributions helps explain why.
The value range is naturally restricted at zero and the warping reshapes the probability density function to taper off so that there’s more probability mass for valid target values.
Summary
It can be tempting to use a normal distribution to model errors. It makes math easier and the central limit theorem tells us normality arises naturally when errors break down into sums over independently identically distributed random variables.
But many regression problems don’t fit into such a framework and error distributions can be far from normal.
When faced with non-normally in the error distribution, one option is to transform the target space. With the right function f , it may be possible to achieve normality when we replace the original target values y with f(y) . Specifics of the problem can sometimes lead to a natural choice for f . At other times, we might approach the problem with a toolbox of fixed transformations and hope that one unlocks normality. But that can be an ad-hoc process.
Warping turns the transformation step into a maximum likelihood problem. Instead of applying fixed transformations, warping uses parameterized functions that can approximate arbitrary transformations and fits the functions to the problem with the help of an optimizer.
Through the transformation function, warping can capture aspects of non-normality in error distributions like skewing and fat tails. For many problems, it leads to better performance on out-of-sample predictions and avoids the ad hocery of working with fixed transformations.
以上所述就是小编给大家介绍的《What to Do When Your Model Has a Non-Normal Error Distribution》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。