Abusing Linear Regression to Make a Point

栏目: IT技术 · 发布时间: 4年前

内容简介:A bunch of people have been sending me links to a particularly sloppy article that (mis)uses linear regression to draw an incorrect conclusion from some data. So I guess I’ve got to got back to good-old linear regression, and talk about it a bit.Let’s star

A bunch of people have been sending me links to a particularly sloppy article that (mis)uses linear regression to draw an incorrect conclusion from some data. So I guess I’ve got to got back to good-old linear regression, and talk about it a bit.

Let’s start with the basics. What is linear regression?

If you have a collection of data – typically data with one independent variable, and one dependent variable (that is, the first variable can vary any way it wants; changing it will change the second variable), then you’re probably interested in how the dependent variable relates to the independent. If you have reason to believe that they should have a linear relationship, then you’d like to know just what that linear relationship is.

If your data were perfect, then you’d just need to plot all of the data points on a graph, with the independent variable on the X axis, and the dependent on the Y, and then your graph would be a line, and you could get its slope and Y intercept, and thus completely capture the relationship.

But data is never perfect. There’s a lot of reasons for that, but no real set of collected data is ever perfect. No matter how perfect the real underlying linear relationship is, real measured data will always show some scatter. And that means that you can draw a lot of possible lines through the collected data. Which one of them represents the best fit?

Since that’s pretty abstract, I’m going to talk a bit about an example – the very example that was used to ignite my interest in math!

Back in 1974 or so, when I was a little kid in second grade, my father was working for RCA, as a physicist involved in manufacturing electronics for satellite systems. One of the important requirements for the products they were manufacturing was that they be radiation hard – meaning that they could be exposed to quite a bit of radiation before they would be damaged enough to stop working.

Their customers – NASA, JPL, and various groups from the U. S. Military, had very strong requirements. They had to show, for a manufacturing setup of a particular component, what the failure profile was.

The primary failure mode of these chips they were making was circuit trace failure. If a sufficiently energetic gamma ray hit one of the circuit traces, it was possible that the trace would burn out – breaking the circuit, and causing the chip to fail.

The test setup that that they used had a gamma ray emitter. So they’d make a manufacturing run to produce a batch of chips from the setup. Then they’d take those, and they’d expose them to increasing doses of radiation from the gamma emitter, and detect when they failed.

For trace failure, the probability of failure is linear in the size of the radiation dose that the chip is exposed to. So to satisfy the customer, they had to show them what the slope of the failure curve was. “Radiation hard” was defined as being able to sustain exposure to some dose of radiation with a specified probability of failure.

So, my dad had done a batch of tests, and he had a ton of little paper slips that described the test results, and he needed to computer the slop of that line – which would give the probability of failure as a multiple of the radiation dose.

I walked into the dining room, where he was set up doing this, and asked what he was doing. So he explained it to me. A lot like I just explained above – except that my dad was a much better teacher than me. I couldn’t explain this to a second or third grader the way that he did!

Anyway… The method that we use to compute the best line is called least squares . The intuition behind it is that you’re trying to find the line where the average distance of all of the datapoints from that line is the smallest. But a simple average doesn’t work well – because some of the data points are above the line, and some are below. Just because one point is, say, above a possible fit by 100, and another is below by 100 doesn’t mean that the two should cancel. So you take the distance between the data points and the line, and you square them – making them all positive. Then you find the line where that total is the smallest – and that’s the best fit.

So let’s look at a real-ish example.

For example, here’s a graph that I generated semi-randomly of data points. The distribution of the points isn’t really what you’d get from real observations, but it’s good enough for demonstration. Abusing Linear Regression to Make a Point

The way that we do that is: first we compute the means of and , which we’ll call and . Then using those, we compute the slope as:

Abusing Linear Regression to Make a Point

Then for the y intercept: Abusing Linear Regression to Make a Point .

In the case of this data: I set up the script so that the slope would be about 2.2 +/- 0.5. The slope in the figure is 2.54, and the y-intercept is 18.4.

Now, we want to check how good the linear relationship is. There’s several different ways of doing that. The simplest is called the correlation coefficient, or .

Abusing Linear Regression to Make a Point

If you look at this, it’s really a check of how well the variation between the measured values and the expected values (according to the regression) match. On the top, you’ve got a set of products; on the bottom, you’ve got the square root of the same thing squared. The bottom is, essentially, just stripping the signs away. The end result is that if the correlation is perfect – that is, if the dependent variable increases linearly with the independent, then the correlation will be 1. If the dependency variable decreases linearly in opposition to the dependent, then the correlation will be -1. If there’s no relationship, then the correlation will be 0.

For this particular set of data, I generated it with a linear equation with a little bit of random noise. The correlation coefficient is slighly greater than 0.95, which is exctly what you’d expect.

Ok, so that’s the basics of linear regression. Let’s get back to the bozo-brained article that started this.

Abusing Linear Regression to Make a Point

They featured this graph:

You can see the scatter-plot of the points, and you can see the line that was fit to the points by linear regression. How does that fit look to you? I don’t have access to the original dataset, so I can’t check it, but I’m guessing that the correlation there is somewhere around 0.1 or 0.2 – also known as “no correlation”.

You see, the author fell into one of the classic traps of linear regression. Look back at the top of this article, where I started explaining it. I said that if you had reason to believe in a linear relationship, then you could try to find it. That’s the huge catch to linear regression: no matter what data you put in, you’ll always get a “best match” line out. If the dependent and independent variables don’t have a linear relation – or don’t have any actual relation at all – then the “best match” fit that you get back as a result is garbage.

That’s what the graph above shows: you’ve got a collection of data points that to all appearances has no linear relationship – and probably no direct relationship at all. The author is interpreting the fact that linear regression gave him an answer with a positive slope as if that positive slope is meaningful. But it’s only meaningful if there’s actually a relationship present.

But when you look at the data, you don’t see a linear relationship. You see what looks like a pretty random scatterplot. Without knowing the correlation coefficient, we don’t know for sure, but that line doesn’t look to me like a particularly good fit. And since the author doesn’t give us any evidence beyond the existence of that line to believe in the relationship that they’re arguing for, we really have no reason to believe them. All they’ve done is demonstrate that they don’t understand the math that they’re using.


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

软技能

软技能

John Sonmez / 王小刚 / 人民邮电出版社 / 2016-7 / 59.00元

这是一本真正从“人”(而非技术也非管理)的角度关注软件开发人员自身发展的书。书中论述的内容既涉及生活习惯,又包括思维方式,凸显技术中“人”的因素,全面讲解软件行业从业人员所需知道的所有“软技能”。本书聚焦于软件开发人员生活的方方面面,从揭秘面试的流程到精耕细作出一份杀手级简历,从创建大受欢迎的博客到打造你,从提高自己工作效率到与如何与“拖延症”做斗争,甚至包括如何投资不动产,如何关注自己的健康。本......一起来看看 《软技能》 这本书的介绍吧!

随机密码生成器
随机密码生成器

多种字符组合密码

XML 在线格式化
XML 在线格式化

在线 XML 格式化压缩工具

正则表达式在线测试
正则表达式在线测试

正则表达式在线测试