Statistical pitfalls in data science

栏目: IT技术 · 发布时间: 4年前

内容简介:There are plenty of ways to infer a large and varied amount of results from a given dataset, but there are infinitely many ways to incorrectly reason from it as well. Fallacies can be defined as the products of inaccurate or faulty reasoning which usually

How stereotypical results can alter data distributions in people’s minds

Jun 1 ·7min read

There are plenty of ways to infer a large and varied amount of results from a given dataset, but there are infinitely many ways to incorrectly reason from it as well. Fallacies can be defined as the products of inaccurate or faulty reasoning which usually leads to one obtaining incorrect results from the data given.

Statistical pitfalls in data science

Photo by Tayla Jeffs on Unsplash

The good thing is that since numerous people have made these mistakes for so long and the results have been documented throughout history in a variety of fields, it is easier to identify and explain many of these statistical fallacies. Here are some statistical traps that data scientists should avoid falling into.

Cherry Picking

This is probably the most obvious and simplistic fallacy that there can be, and is something that most of us have definitely done before. The intuition of cherry picking is as simple as it gets: intentionally selecting data points to help support a particular hypothesis, at the expense of other data points which reject the proposition.

Cherry picking reduces the credibility of experimental findings because it shows only one side of the picture

Cherry picking is not only dishonest and misleading to the public, but it also reduces the credibility of experimental findings, because it essentially shows only one side of the picture, shadowing all the negative aspects. This would make an experiment seem entirely successful, when in reality it isn’t.

Statistical pitfalls in data science

Cherry Picked Data vs All Data — Source

The easiest way to avoid cherry picking is not to do it! Cherry picking, by nature, is a deliberate effort brought on by the practitioner and therefore, not accidental. To further avoid the possibility of cherry picking while in the process of collating data, one should use data from a large and varied set of backgrounds (wherever possible) to limit the bias that usually comes along with limited perspective.

Data Dredging

Most people (especially those who are unfamiliar with the nuances of data science) assume that data analysis means picking out obvious correlations from a variety of data. That is not entirely correct as data analysis often requires logical reasoning to explain why a certain correlation exists. Without a proper explanation, there still remains a possibility of chance correlations. The traditional method to proceed with an experiment is to define a hypothesis, followed by an examination of data to prove it. Data dredging, in contrast is a practice of making chance correlations that fit the hypothesis without offering any logical insights into the reasons for the correlation .

Data dredging is sometimes described as seeking more information from a dataset than it actually contains

An offshoot of data dredging is the False Causality where a wrong assumption about correlations could lead to eventual failure in research. Often correlations between two things tempt us to believe that one caused the other or is caused by the other. However, it is usually a coincidence or another, external factor that causes one or both the effects to occur. A data scientist must always dig deeper than what seems to be apparent on the surface and go beyond simple correlations to gather evidence to back research hypothesis.

Correlation does not imply causation

Overfitting

Overfitting is a term that most machine learning and data science practitioners are well-versed with. Overfitting refers to the process of creating an extremely complex model that is overly tailored to the dataset and does not perform well on generalized data.

In machine learning terms, overfitting occurs when a model performs exceedingly well on the training set but fails to give similar results on the testing dataset. John Langford gives a comprehensive description of the most commonly occurring types of overfitting in practice and techniques to help avoid them here .

Statistical pitfalls in data science

Overfitting of data — Source

Most data scientists build mathematical based models to understand the underlying relations and correlations between data points. A sufficiently complex model would tend to fit the provided data perfectly, giving high accuracy and minimal loss. That being said, complex models are usually brittle and would break down when provided with other data. Simple models usually tend to be more robust and better at making predictions from given data.

Simpson’s Paradox

Simpson’s Paradox is a perfect example that highlights the need for good intuition regarding the real world when collating and experimenting on data. Data scientists need to recognize and accept the fact that most data is a finite representation of a much larger and much more complex domain. Simpson’s Paradox showcases the dangers of oversimplifying a complex situation by trying to see it from a single point-of-view.

The Simpson’s Paradox was named after Edward Hugh Simpson, a statistician who described the statistical phenomenon that takes his name in a technical paper in 1951. It is simple to state and can be often a cause of confusion for non-statistically trained audiences — A trend or result that is present when data is put into groups that reverses or disappears when the data is combined.

Statistical pitfalls in data science

The overall trend reverses when data is grouped by particular categories — Source

The Simpson’s Paradox can be best explained by a simple example. Let’s say we pick the batting scores of two batsmen in Cricket, A and B. In our collected data, A has overall scored more boundaries than B. However, if we look at the lifetime statistics of A and B, it is found out that B has scored more boundaries than A.

Simpson’s Paradox, in some ways, can be thought of unintentional cherry picking. It is usually caused by a variable within the distribution appropriately named the lurking variable which splits data into multiple separate distributions, and they can be often difficult to identify.

We need to know what we are looking for, and to appropriately choose the best data-viewpoint that gives the audience a fair and complete representation of the truth

To avoid falling in the Simpson’s Paradox, a data scientist must know their data, and have a basic idea about the general factors that surround and affect the data. Based on all of these circumstances, the data should be collected and viewed in such a way that the results do not glorify only the hypothesis (cherry picking), but also do not change if viewed from a standalone viewpoint.

Survivorship Bias

Algorithmic bias has recently gained a lot of following and has become a hot topic. Statistical bias, however is as old as statistics itself. Survivorship bias can be best described as drawing conclusions from incomplete data . These play a crucial role in making data analysis inaccurate.

Survivorship bias occurs when the data provided in the dataset has previously been subjected to a filtering process. This results in a faulty deduction and can affect a great deal of analysis. Being aware of biases is usually really crucial in the field of data science because it is human tendency to study successful outcomes and draw inferences from them, while ignoring the accompanying failures.

By looking at just the successful CEOs, we don’t see the full data set, including unsuccessful CEOs and everyone else on the planet that may happen to eat oatmeal for breakfast

Since survivorship bias comes with incomplete datasets and research inputs, there are some techniques that a data scientist can apply to avoid survivorship bias while drawing deductions from data.These include but are not limited to multiple data inputs, imaginary scenarios, contextual understanding of the data as well as increased data while testing.

Gambler’s Fallacy

The gambler’s fallacy is another example of how the human mind tends to draw inferences from stereotypical correlations in the data. The Gambler’s Fallacy states that because something recently occurred more frequently, it is now less likely to occur (and vice-versa) . However, this does not really hold true in real life. For example, if a coin lands on head 3 times in a row, one would think along the lines of there is no way the coin lands on the heads side 4 times in a row . However, this is wrong, as there is still an equal probability that the coin may land either on heads or tails.

The same thing happens with data. When multiple data points begin to show similar unaccounted correlations or contradictions, data scientists usually tend to go by a gut feeling rather than logical explanations and formulas, which often lead to disastrous consequences while inferring deductions.

People tend to go with a gut feeling based on previous experiences rather than logical explanations when drawing out inferences from data

The Gambler’s Fallacy requires two key points for its understanding: the law of large numbers and its relation with regression towards the mean. The law of large numbers states that the mean of all the results of performing the exact same experiment for a large number of times should be closed to the expected value, and the difference between the expected value and the original value would be directly proportional to the number of trials conducted.

The concept of regression towards the mean also introduces the Regression Fallacy which assumes that when something happens that is unusually good or bad, over time it will revert back to average. This fallacy is often used to find an explanation for the outliers that are generated in the predictions by a study or model.


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

剑指Offer

剑指Offer

何海涛 / 电子工业出版社 / 2012-1 / 45.00元

《剑指Offer:名企面试官精讲典型编程题》剖析了50个典型的程序员面试题,从基础知识、代码质量、解题思路、优化效率和综合能力五个方面系统整理了影响面试的5个要点。全书分为7章,主要包括面试的流程,讨论面试流程中每一环节需要注意的问题;面试需要的基础知识,从编程语言、数据结构及算法三方面总结了程序员面试的知识点;高质量的代码,讨论影响代码质量的3个要素(规范性、完整性和鲁棒性),强调高质量的代码除......一起来看看 《剑指Offer》 这本书的介绍吧!

JS 压缩/解压工具
JS 压缩/解压工具

在线压缩/解压 JS 代码

Base64 编码/解码
Base64 编码/解码

Base64 编码/解码

HEX HSV 转换工具
HEX HSV 转换工具

HEX HSV 互换工具