Beyond the mean, median, and mode

栏目: IT技术 · 发布时间: 4年前

内容简介:[Thanks to Drake Thomas and Mike Winston for discussion.]In third grade math class, my teacher Ms. Potter taught my class about the mean, median, and mode of a list of numbers. What united these numbers, Ms. Potter told us, was that they wereI remember bei

[Thanks to Drake Thomas and Mike Winston for discussion.]

In third grade math class, my teacher Ms. Potter taught my class about the mean, median, and mode of a list of numbers. What united these numbers, Ms. Potter told us, was that they were measures of central tendency : numbers that represented, in some sense, the “middle” of the data.

I remember being dissatisfied with the mode being labeled a measure of central tendency. After all, it’s really easy to construct an example (1, 1, 2, 3, 4, …, 100) where the mode is nowhere close to the middle of the data. I don’t remember whether I voiced this objection to Ms. Potter, but my mental model of her would have responded with “Well, for typical lists of numbers that occur in real life, the mode generally is close to the middle” — or anyway, that’s the half-satisfying explanation I ended up giving myself.

But as pointed out by Buck Shlegeris, there is some really cool math connecting the mean, median, and mode, and the mode really does deserve its place as a measure of central tendency.

To re-explain Buck’s post: the median Beyond the mean, median, and mode of a list of numbers is the number that minimizes the sum of the distances to the numbers in the list. Try thinking about why this is: take some list of numbers (of odd length, for simplicity) — 0, 1, 5, 20, 23, for example — and think about how the function Beyond the mean, median, and mode , defined as the sum of the distances from to each number, i.e. Beyond the mean, median, and mode , varies with . I think you’ll be able to convince yourself that Beyond the mean, median, and mode is minimized when Beyond the mean, median, and mode , and that this is the case precisely because is the median number in the list.

(For an even number of elements, there is a tie for this minimum value among all numbers in between the two middle numbers, inclusive — so for the list 0, 1, 5, 8, 20, 23, every number in the interval Beyond the mean, median, and mode could be considered the median, per this new interpretation of the median).

What about the mean? It turns out that the mean of a list is the number the minimizes the sum of the squares of the distances to the numbers in the list. One way to see this is, say your list of numbers is Beyond the mean, median, and mode . We are looking for the minimum of Beyond the mean, median, and mode . Taking the derivative with respect to , we find that Beyond the mean, median, and mode is minimized when Beyond the mean, median, and mode , which is true precisely when is the mean.

And the mode? It turns out that’s just the number minimizing Beyond the mean, median, and mode (if we use the convention that Beyond the mean, median, and mode ). That’s because Beyond the mean, median, and mode is equal to minus the number of times that appears in the list.

And so if you’re mathematically inclined, you’re probably thinking, why stop there? We can define Beyond the mean, median, and mode (the p-median ) of a data set to be the that minimizes Beyond the mean, median, and mode . The 0-median is the mode, the 1-median is the median, the 2-median is the mean… maybe other values of are interesting as well.

At this point in the post, if you’re so inclined, it would be a good time to pause and see what you can discover about p-medians for general values of (the case Beyond the mean, median, and mode is probably most interesting) for yourself. Or if not, keep reading!

It turns out that when Beyond the mean, median, and mode , the p-median of a list is always one of the numbers in the list! I made a GeoGebra file if you’d like to play around and get some intuition for why this is true. Here’s a formal argument: consider

Beyond the mean, median, and mode .

Here’s a plot of Beyond the mean, median, and mode with the list [0, 2, 3, 5, 8, 11] for Beyond the mean, median, and mode :

Beyond the mean, median, and mode

We have

Beyond the mean, median, and mode

which is negative on each interval Beyond the mean, median, and mode , meaning that Beyond the mean, median, and mode is concave on each such interval (as you can see in the picture). Therefore the only possible minima of Beyond the mean, median, and mode are at the points .

So, given a list, which values are the p-median for some Beyond the mean, median, and mode ? Here’s an instructive example: consider the list

0, 1, 1, 3, 4, 5, 50, 60, 61, 62, 70, 80, 90.

The p-median changes with p as follows:

  • For Beyond the mean, median, and mode , the p-median is 50 (the median).
  • For Beyond the mean, median, and mode , the p-median is 60.
  • For Beyond the mean, median, and mode , the p-median is 61.
  • For Beyond the mean, median, and mode , the p-median is 4.
  • For Beyond the mean, median, and mode , the p-median is 3.
  • For Beyond the mean, median, and mode , the p-median is 1.

The key intuition here is that for p close to 1, the p-median is near the middle of the list . As p gets smaller, being close to other elements matters more and more. This makes sense, because the 0-median is the mode, i.e. the number in the list with the greatest number of other list members that are exactly equal to it.

In this example,  several elements of the list are the p-median for some p; but it is no coincidence that 0 and 90 never are. Assuming that the smallest and largest elements of the list each appear only once, they can never be the p-median. To see that, take the smallest and second-smallest elements, and consider the distances from them to every other element in the list. In our earlier example [0, 1, 5, 20, 23], the distance from 0 to the other elements is 1, 5, 20, 23, and the distance from 1 to the other elements is 1, 4, 19, 22. Comparing these distances one by one, we find that the distances from the second-smallest elements are always smaller than the corresponding distances from the smallest element, meaning that for any p, the sum of distance raised to the p-th power is smaller for the second-smallest element than the smallest. This reasoning also works for the largest and second-largest elements.

Is it possible to have a list of numbers of arbitrary length such that every element besides the two extreme ones serves as the p-median for some Beyond the mean, median, and mode ? This is a fun exercise; see this footnote for the answer.

Another question is: what’s the behavior of the p-median in the limit as p approaches 0? This is a cool question because it’s a natural way to define the mode of a list of distinct numbers. Recall that for close to zero, we have Beyond the mean, median, and mode . This lets us write

Beyond the mean, median, and mode

for p close to 0, which means that the limiting value of Beyond the mean, median, and mode is the element of the list that minimizes the sum of the logs to the other elements. Equivalently, the mode of a list of distinct numbers is the number the product of whose distances to the other numbers is as small as possible . So for instance, the mode of the list 0, 1, 2, 3, 4, 5, 10, 11, 12, 13 is 3. It’s pretty cool that the mode generalizes so naturally!

This leads us to an interesting connection of the p-median with the generalized mean . The p-power mean of a list of nonnegative numbers Beyond the mean, median, and mode is defined to be

Beyond the mean, median, and mode .

Familiar cases are the arithmetic mean ( Beyond the mean, median, and mode ), the quadratic mean /root mean square ( Beyond the mean, median, and mode ), the harmonic mean ( Beyond the mean, median, and mode ), and — interestingly — the geometric mean (the limit as p approaches 0).

The median (1-median) of a list of numbers is the number that minimizes the arithmetic mean (1-power mean) of the distances to the numbers in the list. The (arithmetic) mean (2-median) is the number that minimizes the quadratic mean (2-power mean) of the distances. The mode (0-median) minimizes the geometric mean (0-power mean) of the distances.More generally, perhaps a more natural definition of the p-median is the number that minimizes the p-power mean of distances to the numbers in the list.

Indeed, this is the definition that extends naturally to negative values of p, allowing us to define the -1-median as the number maximizing the sum of the reciprocals of the distances to all other numbers.

What about the limit as p approaches infinity? As p grows to infinity, the p-median becomes the average of the smallest and largest numbers in the list, because large distances are punished more and more relative to smaller ones.

And as p approaches negative infinity? There, having small distances to one’s neighbors is increasingly rewarded, so the p-median becomes the number in the list with the smallest distance to its closest neighbor. (Well, there are two such numbers; it’s the one whose distance to its other neighbor is smaller.)

So as p varies, we observe some interesting behaviors:

  • When p is really large, the p-median is the average of the two extreme elements .
  • As p gets closer to 2, the p-median becomes more and more equally influenced by each element of the list (in the sense that perturbing each element has the same effect on the p-median).
  • As p further decreases to 1, the p-median approaches the middle value(s) of the list.
  • As p decreases toward and beyond 0, the p-median tends toward elements that are close to other elements . In the limit as p becomes very negative, being close to your nearest neighbor is the only thing that matters.

I see the p-median, for Beyond the mean, median, and mode , as a summary of the data which may be a reasonable choice under some circumstances. Let’s say you’re at a party and want to know the population of the United States, in millions. You ask around and get some guesses:

100, 320, 320, 325, 330, 400 (median), 500, 600, 700, 1000, 1500

But something interesting is going on in the data: four of the numbers are really close together. A reasonable conclusion to draw from this data set is that four of the people you asked know the approximate answer, while seven are just guessing. If I knew nothing about the population of the United States and saw these answers, I’d guess that the population is 325 million even though the median answer was 400 million (and the mean is even larger).

This is what the p-median accomplishes for Beyond the mean, median, and mode : in its choice of a consensus, it balances clustering (detecting knowledge of the underlying matter) with moderation (looking for middle ground in a way that’s resistant to extreme answers). And indeed, as you decrease p from 1 to 0, around Beyond the mean, median, and mode the p-median switches from 400 to 330. Around Beyond the mean, median, and mode it switches to 325, and around Beyond the mean, median, and mode it switches to 320 (the mode).

So here’s a bold conjecture: in practice, when estimating quantities in the real world from asking around, using the 0.8-median is better than using the median.

I’m very uncertain in this conjecture –I’m not sure I’d bet on it at even odds (though it’s close) — but I find it very plausible and it would be very interesting if it were true.

1. Yes, it is possible! One example is Beyond the mean, median, and mode (for arbitrary ). The idea is that when Beyond the mean, median, and mode , the p-median (i.e. the median) is Beyond the mean, median, and mode . As you decrease p, the p-median becomes more and more “mode-ish” (i.e. having lots of numbers close to yours matters more and more relative to being in the middle of the data). So as you decrease p, the p-median switches to numbers that are further and further form the median. You can easily modify this example to get lists of even length with the same property.

2. Well technically that’s 0 for every number in the list, but the limit of the p-median as you push p to zero is the number minimizing the geometric mean of the distances to all other numbers.


以上所述就是小编给大家介绍的《Beyond the mean, median, and mode》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

像计算机科学家一样思考Python (第2版)

像计算机科学家一样思考Python (第2版)

[美] 艾伦 B. 唐尼 / 赵普明 / 人民邮电出版社 / 2016-7 / 49.00

本书以培养读者以计算机科学家一样的思维方式来理解Python语言编程。贯穿全书的主体是如何思考、设计、开发的方法,而具体的编程语言,只是提供了一个具体场景方便介绍的媒介。 全书共21章,详细介绍Python语言编程的方方面面。本书从基本的编程概念开始讲起,包括语言的语法和语义,而且每个编程概念都有清晰的定义,引领读者循序渐进地学习变量、表达式、语句、函数和数据结构。书中还探讨了如何处理文件和......一起来看看 《像计算机科学家一样思考Python (第2版)》 这本书的介绍吧!

XML、JSON 在线转换
XML、JSON 在线转换

在线XML、JSON转换工具

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换

正则表达式在线测试
正则表达式在线测试

正则表达式在线测试