What to Do When Your Model Has a Non-Normal Error Distribution

栏目: IT技术 · 发布时间: 4年前

OPTIMIZATION AND MACHINE LEARNING

What to Do When Your Model Has a Non-Normal Error Distribution

How to use warping to fit arbitrary error distributions

Mar 13 ·13min read

What to Do When Your Model Has a Non-Normal Error Distribution

Photo by Neil Rosenstech on Unsplash

O ne of the most import things a model can tell us is how certain it is in a prediction. An answer to this question can come in the form of an error distribution. An error distribution is a probability distribution about a point prediction telling us how likely each error delta is.

The error distribution can be every bit as important than the point prediction.

Suppose you’re an investor considering two different opportunities (A and B) and using a model to predict the one-year returns (as a percentage of the amount invested). The model predicts A and B to have the same expected 1-year return of 10% but shows these error distributions

What to Do When Your Model Has a Non-Normal Error Distribution

What to Do When Your Model Has a Non-Normal Error Distribution

Even though both opportunities have the same expected return, the error distribution shows how different they are. B is tightly distributed about its expected value with little risk of losing money; whereas, A is more like a lottery. There’s a small probability of a high payout (~500% return); but the majority of the time, we’ll lose everything (~-100% return).

A point prediction tells us nothing about where target values are likely to be distributed. If it’s important to know how far off a prediction can be, or if target values can be clustered about fat tails , then an accurate error distribution becomes essential.

An easy way to get the error distribution wrong is to try to force it into a form it doesn’t take. This frequently happens when we reach for the convenient, but often misapplied, normal distribution.

The normal distribution is popular for good reason. In addition to making math easier, the central limit theorem tells us the normal distribution can be a natural choice for many problems.

How can a normal distribution come about naturally?

Let X denote a feature matrix and b denote a vector of regressors. Suppose target values are generated by the equation

What to Do When Your Model Has a Non-Normal Error Distribution

where

The central limit theorem says that if the E ’s are independently identically distributed random variables with finite variance, then the sum will approach a normal distribution as m increases.

Even when E is wildly non-normal, e will be close to normal if the summation contains enough terms.

Let’s look at a concrete example. Set b = (-2, 3) . Let the entries of X be generated independently from the uniform distribution [-1, 1] . We’ll generate the E’ s from this decidedly non-normal distribution

What to Do When Your Model Has a Non-Normal Error Distribution

We normalize the error distribution for e to have unit variance and allow the number of terms m to vary. Here are histograms of the errors (in orange) from least squares models taken from runs of a simulation for different values of m and overlaid with the expected histogram of the errors if they were normally distributed (in blue)¹.

What to Do When Your Model Has a Non-Normal Error Distribution

What to Do When Your Model Has a Non-Normal Error Distribution

What to Do When Your Model Has a Non-Normal Error Distribution

What to Do When Your Model Has a Non-Normal Error Distribution

With greater values of m, the error histogram gets increasingly closer to that of the normal distribution.

When there’s reason to think that error terms break down into sums of independent identically distributed factors like this, the normal distribution is a good choice. But in the general case, we have no reason to assume it. And indeed, many error distributions are not normal exhibiting skewing and fat tails .

What should we do when we have non-normality in an error distribution?

This is where warping helps us². It uses the normal distribution as a building block but gives us knobs to locally adjust the distribution to better fit the errors from the data.

To see how warping works, observe that if f(y) is a monotonically increasing surjective function and p(z) is a probability density function, then p(f(y))f′(y) forms a new probability density function.

because f′(y) ≥ 0; and after applying the substitution u=f(y) , we see that

Let’s look at an example to see how f can reshape a distribution. Suppose p(z) is the standard normal distribution N(0, 1) and f(y) is defined by

What to Do When Your Model Has a Non-Normal Error Distribution

where c > 0; and between [0, 1], f is a spline that smoothly transitions between y and cy. Here’s what f looks like for a few different values of c

What to Do When Your Model Has a Non-Normal Error Distribution

and here are what the resulting warped probability distributions look like³

What to Do When Your Model Has a Non-Normal Error Distribution

When c = 2 , area is redistributed from the standard normal distribution so that the probability density function (PDF) peaks and then quickly falls off so as to have a thinner right tail. When c = 0.5 , the opposite happens: the PDF falls off quickly and then slows its rate of decline so as to have a fatter right tail.

Now, imagine f is parameterized by a vector ψ that allows us to make arbitrary localized adjustments to the rate of increase. (More on how to parameterize f later). Then with suitable ψ , f can fit a wide range of different distributions. If we can find a way to properly adjust ψ , then this will give us a powerful tool to fit error distributions.

How to adjust warping parameters?

A better fit error distribution makes the errors on the training data more likely. It follows that we can find warping parameters by maximizing likelihood on the training data.

First, let’s look at how maximizing likelihood works without warping.

Let θ denote the parameter vector for a given regression model. Let g(x; θ) represents the prediction of the model for feature vector x . If we use a normal distribution with standard deviation σ to model the error distribution of predictions, then the likelihood of the training data is

and the log-likelihood is

Put

What to Do When Your Model Has a Non-Normal Error Distribution

(RSS stands for residual sum of squares )

For θ fixed, σ maximizes likelihood when

What to Do When Your Model Has a Non-Normal Error Distribution

More generally, if σ² = cRSS ( c > 0 ), then the log-likelihood simplifies to

And we see that likelihood is maximized when θ minimize RSS.

Now, suppose we warp the target space with the monotonic function f parameterized by ψ. Let f(y; ψ) denote a warped target value. Then the likelihood with the warped error distribution is

and the log-likelihood becomes

Or with

and σ² = cRSS

To fit the error distribution, we’ll use an optimizer to find the parameters (θ, ψ) that maximize this likelihood.

For an optimizer to work, it requires a local approximation to the objective that it can use to iteratively improve on parameters. To build such an approximation, we’ll need to compute the gradient of the log-likelihood with respect to the parameter vector.

Put

We can use L as a proxy for the log-likelihood since it differs only by a constant.

Warping is a general process that can be applied to any base regression model, but we’ll focus on the simplest base model, linear regression.

How to warp a linear regression model?

With linear regression, we can derive a closed form for θ . Let Q and R be the matrices of the QR-factorization of the feature matrix X

What to Do When Your Model Has a Non-Normal Error Distribution

where Q is orthogonal and R is rectangular triangular. Put

and let denote the vector that minimizes RSS for the warped targets z

What to Do When Your Model Has a Non-Normal Error Distribution

Put

What to Do When Your Model Has a Non-Normal Error Distribution

Then

What to Do When Your Model Has a Non-Normal Error Distribution

If X has m linear independent columns, then the first m rows of the rectangular triangular matrix R have non-zero entries on the diagonal and the remaining rows are 0. It follows that

What to Do When Your Model Has a Non-Normal Error Distribution

for i ≤m and

What to Do When Your Model Has a Non-Normal Error Distribution

for i>m . Therefore,

Let P be the n x n diagonal matrix with

What to Do When Your Model Has a Non-Normal Error Distribution

Set

What to Do When Your Model Has a Non-Normal Error Distribution

Then

What to Do When Your Model Has a Non-Normal Error Distribution

Substituting these equations into the log-likelihood proxy, we get

And differentiating with respect to a warping parameter gives us

What to Do When Your Model Has a Non-Normal Error Distribution

Using these derivatives, an optimizer can climb to warping parameters ψ that maximize the likelihood of the training data.

How to make predictions with a warped linear regression model?

Now that we’ve found warping parameters, we need to make predictions.

Consider how this works in a standard ordinary least squares model without warping. Suppose data is generated from the model

What to Do When Your Model Has a Non-Normal Error Distribution

where ε is in N(0, σ²). Let X and y denote the training data. The regressors that minimize the RSS of the training data are

If x′ and y denote an out-of-sample feature vector and target value

What to Do When Your Model Has a Non-Normal Error Distribution

then the error of the out-of-sample prediction is

What to Do When Your Model Has a Non-Normal Error Distribution

Because ε and ε′ are normally distributed, it follows that e′ is normally distributed and the variance is⁴

What to Do When Your Model Has a Non-Normal Error Distribution

We rarely know the noise variance σ², but we can use this equation to obtain an unbiased estimate for it

What to Do When Your Model Has a Non-Normal Error Distribution

where p is the number of regressors.

Suppose now the ordinary least squares model is fitted to the warped target values

Ordinary least squares gives us a point prediction and error distribution for the latent space, but we need to invert the warping to get a prediction for the target space.

Let represent the latent prediction for an out-of-sample feature vector x′ . If is the estimated latent noise variance, then the probability of a target value y is

and the expected target value is

After making the substitution u=f(y) , the expected value can be rewritten as

The inverse of f can be computed using Newton’s method to find the root of f(y)u, and the integral can be efficiently evaluated with a Gauss-Hermite quadrature .

What are some effective functions for warping?

Let’s turn our attention to the warping function f(y; ψ) and how to parameterize it. We’d like for the parameterization to allow for a wide range of different functions, but we also need to ensure that it only permits monotonically increasing surjective warping functions.

Observe that the warping function is invariant under rescaling: c f(y; ψ) leads to the same results as f(y; ψ) . Set θ′ so that g(x; θ )=c g(x;θ). Then the log likelihood proxy L(ψ, θ′) for c f(y; ψ) is

What to Do When Your Model Has a Non-Normal Error Distribution

What’s important is how the warping function changes the relative spacing between target values.

One effective family of functions for warping is

Each tanh step allows for a localized change to the warping function’s slope. The t term ensures that the warping function is monotonically surjective and reverts back to the identity when t is far from any step. And because of the invariance to scaling, it’s unnecessary to add a scaling coefficient to t .

We’ll make one additional adjustment so that the warping function zeros the mean. Put

What to Do When Your Model Has a Non-Normal Error Distribution

An Example Problem

The Communities and Crime Dataset⁵ provides crime statistics for different localities across the United States. As a regression problem, the task is to predict the violent crime rate from different socio-economic indicators. We’ll fit a warped linear regression model to the dataset and compare how it performs to an ordinary least squares model.

Let’s look at the warping function fit to maximize the log-likelihood on the training data.

What to Do When Your Model Has a Non-Normal Error Distribution

Let σ denote the estimated noise standard deviation in the latent space. To visualize how this function changes an error distribution, we’ll plot the range

across the target values

What to Do When Your Model Has a Non-Normal Error Distribution

Warping makes a prediction’s error range smaller at lower target values⁶.

To see if warping leads to better results, let’s compare the performance of a warped linear regression model (WLR) to an ordinary least squares model (OLS) on a ten-fold cross-validation of the communities dataset. We use Mean Log-Likelihood (MLL) as the error measurement. MLL averages the log-likelihood of each out-of-sample prediction in the cross-validation⁷.

What to Do When Your Model Has a Non-Normal Error Distribution

The results show warped linear regression performing substantially better. Drilling down on a few randomly chosen predictions and their error distributions helps explain why.

What to Do When Your Model Has a Non-Normal Error Distribution

What to Do When Your Model Has a Non-Normal Error Distribution

What to Do When Your Model Has a Non-Normal Error Distribution

What to Do When Your Model Has a Non-Normal Error Distribution

The value range is naturally restricted at zero and the warping reshapes the probability density function to taper off so that there’s more probability mass for valid target values.

Summary

It can be tempting to use a normal distribution to model errors. It makes math easier and the central limit theorem tells us normality arises naturally when errors break down into sums over independently identically distributed random variables.

But many regression problems don’t fit into such a framework and error distributions can be far from normal.

When faced with non-normally in the error distribution, one option is to transform the target space. With the right function f , it may be possible to achieve normality when we replace the original target values y with f(y) . Specifics of the problem can sometimes lead to a natural choice for f . At other times, we might approach the problem with a toolbox of fixed transformations and hope that one unlocks normality. But that can be an ad-hoc process.

Warping turns the transformation step into a maximum likelihood problem. Instead of applying fixed transformations, warping uses parameterized functions that can approximate arbitrary transformations and fits the functions to the problem with the help of an optimizer.

Through the transformation function, warping can capture aspects of non-normality in error distributions like skewing and fat tails. For many problems, it leads to better performance on out-of-sample predictions and avoids the ad hocery of working with fixed transformations.


以上所述就是小编给大家介绍的《What to Do When Your Model Has a Non-Normal Error Distribution》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Web标准设计

Web标准设计

刘杰(嗷嗷) / 清华大学出版社 / 2009-1 / 75.00元

一扇经常开启的门的铰链不需要润滑油。 一条湍急的河流不会变得污浊。 无论是声音还是想法都不可能在真空中传播。 Web标准如果不用就会腐朽。 这世界真奇妙! 专题页面:http://www.aoao.org.cn/book/web-standards-design/一起来看看 《Web标准设计》 这本书的介绍吧!

HTML 压缩/解压工具
HTML 压缩/解压工具

在线压缩/解压 HTML 代码

在线进制转换器
在线进制转换器

各进制数互转换器

RGB CMYK 转换工具
RGB CMYK 转换工具

RGB CMYK 互转工具