Entropy: An Introduction

栏目: IT技术 · 发布时间: 4年前

内容简介:As the first digital computers were being invented in the 1940s and researchers like Alan Turing were formalizing their capabilities, the scientific and engineering communities needed a way to quantify previous intuitive notions ofWe’ve provided a couple e

As the first digital computers were being invented in the 1940s and researchers like Alan Turing were formalizing their capabilities, the scientific and engineering communities needed a way to quantify previous intuitive notions of information and communication . In 1948 Claude Shannon, during a stint at Bell Labs (also the birthplace of the transistor and Unix operating systems, among other major achievements in computer science) provided the first comprehensive framework to do so in his landmark paper A Mathematical Theory of Communication . With the tools provided by this theory (now known as Information Theory ), we can answer important questions about the limits of modern computers like: How can we store information (say, a file on our hard drive) in the most efficient way? Or, how do we successfully communicate information (say, a text message) over a channel (e.g., our cell network) that may not be 100% reliable?

We’ve provided a couple examples of “information” in the abstract sense, but how can we define information in a formal, precise way? We’ll start with an example. Let’s say the city of Seattle has decided to introduce a new service, whereby every morning each resident gets a text message with the day’s weather forecast. If it’s wintertime, where the weather is overcast ~100% of the time, sending the text is redundant; a resident will always be correct if they have basic knowledge of the city’s weather patterns and assume that every day will be overcast. In other words, the forecast won’t provide any new information for city residents , since there’s no uncertainty in the outcome.

Now, suppose instead we’re in a different season with multiple possibilities for the day’s weather. In this case our service will actually be useful - residents no longer can just assume that the weather will be overcast every day and be correct 100% of the time. Just how useful is a given forecast? Suppose we’re in the summer, and the weather (optimistically) is sunny 75% of the time and overcast 25% of the time. In this case, a resident would be correct on most days if they naively assumed the day’s weather would be sunny, and a forecast of a sunny day thus doesn’t provide much new information as compared to the naive baseline. Correspondingly, due to their relative rarity, a forecast indicating a cloudy day would be much more informative, as the naive prediction would be incorrect in this case.

From the previous example we see that the information transmitted when communicating the outcome of an event is intimately related to frequency of that outcome. With this idea in mind, we introduce our first concept in information theory, self-information .

Let’s say we have some discrete random variable X p ( x ) X \sim p(x) taking on a set of potential values x X x \in \mathcal{X} with a corresponding probability mass function p ( x ) p(x) . We then define

The self-information (sometimes known as "surprisal" ) of an outcome x X x \in \mathcal{X} is defined as I ( x ) = 1 log p ( x ) = log p ( x ) I(x) = \frac{1}{\log p(x)} = -\log p(x)

We usually take the log \log in the above definition to be base 2, so self-information is measured in bits . At a first glance our definition of I ( x ) I(x) may seem arbitrary, but we can see that it captures the ideas we had in our thought experiment earlier, along with some other nice properties. As p ( x ) 1 p(x) \to 1 , we have that I ( x ) 0 I(x) \to 0 . This aligns with our intuition that information about an event is only transferred from a source to a recipient if the recipient was not already aware of that event’s outcome. Similarly, as p ( x ) 0 p(x) \to 0 , we have that I ( x ) I(x) \to \infty , reflecting that the communication of a rare outcome is more informative than a common one.

Finally, suppose we wish to communicate the outcomes of two independent events X 1 , X 2 i.i.d. p ( x ) X_1, X_2 \stackrel{\text{i.i.d.}}{\sim} p(x) . Since the outcomes of X 1 X_1 and X 2 X_2 are independent of each other, we’d expect that a compound message containing both outcomes would convey an amount of information equal to the sum of sending two messages containing each outcome individually. The self-information satisfies this property too, since for some joint outcome ( x 1 , x 2 ) X × X (x_1, x_2) \in \mathcal{X} \times \mathcal{X} we have

I ( x 1 , x 2 ) = log p ( x 1 , x 2 ) = log ( p ( x 1 ) p ( x 2 ) ) = [ log p ( x 1 ) + log p ( x 2 ) ] = I ( x 1 ) + I ( x 2 ) \begin{aligned} I(x_1, x_2) &=-\log p(x_1, x_2)\\ &= -\log\big(p(x_1)p(x_2)\big)\\ &= -[\log p(x_1) + \log p(x_2)]\\ &= I(x_1) + I(x_2) \end{aligned}

which matches from what we’d expect of a measure of information.

Now, suppose out outcome isn’t known ahead of time, but we still want a sense of how much information we’ll be transmitting. One simple way to do this is to weight the self-information I ( x ) I(x) of each outcome by the probability of that outcome p ( x ) p(x) . This gives us our second major concept in information theory, entropy

The entropy H ( X ) H(X) of a random variable X X is defined as H ( X ) = x X p ( x ) I ( x ) = x X p ( x ) log p ( x ) H(X) = \sum_{x \in \mathcal{X}}p(x)I(x) = -\sum_{x \in \mathcal{X}}p(x)\log p(x)

By convention we assign a value of 0 0 to p ( x ) log p ( x ) p(x)\log p(x) when x = 0 x = 0 , even though technically 0 log 0 0\log 0 is undefined. We can justify this by continuity, as it can be shown x log x 0 x \log x \to 0 as x 0 + x \to 0^{+} .This too aligns with our intuition - an outcome with no chance of occuring doesn’t have any effect on the variability of our event X X .

Entropy is quite possibly the “fundamental unit” of information theory, and it’ll continue coming up in all kinds of interesting ways. We can think of H ( X ) H(X) as quantifying how much we’ll be “surprised” by the outcome of X X on average. An example will help illustrate this idea

Let X B e r n ( p ) X \sim Bern(p) . In other words, we have X = { 1 with probability  p 0 with probability  1 p X = \begin{cases} 1 & \text{with probability $p$} \\ 0 & \text{with probability $1-p$} \end{cases}

Plugging in to our formula for entropy, we have

H ( X ) = p log p ( 1 p ) log ( 1 p ) H(X) = -p\log p - (1 - p)\log(1-p)

To build up our intuition we can visualize our entropy for different values of p p in the following graph (where we let H ( p ) H(p) denote H ( X ) H(X) as a function of p p )

Figure 1. Entropy of X B e r n ( p ) X \sim Bern(p) as a function of p p (mouse over to see precise values). Note that H ( p ) H(p) is maximized when p = 1 / 2 p = 1/2 (maximum variability in the outcome) and minimized when p = 0 p = 0 or p = 1 p = 1 (no variability in the outcome).

As we would expect, H ( X ) H(X) is 0 when X X is deterministic ( p = 0 p = 0 or p = 1 p = 1 ) as there’s no potential for surprise in such cases. On the other hand, it’s maximized when p = 1 / 2 p = 1/2 when the variability in outcomes is greatest. We’ll work through another more complicated example before moving on.

Let X G e o m ( p ) X \sim Geom(p) . In other words, the probability that we need k k Bernoulli trials before reaching our first 1 is (letting q = 1 p q = 1 - p ) P ( X = k ) = q k 1 p P(X = k) = q^{k-1}p

Plugging into our formula for entropy we have

H ( X ) = k = 1 q k 1 p log ( q k 1 p ) = [ k = 1 p q k 1 log p + k = 1 p q k 1 log q ] = [ k = 0 p q k log p + k = 0 p q k log q ] \begin{aligned} H(X) &= -\sum_{k=1}^{\infty}q^{k-1}p\cdot \log\big(q^{k-1}p\big) \\ &= -\bigg[\sum_{k=1}^{\infty}pq^{k-1}\log p + \sum_{k=1}^{\infty}pq^{k-1}\log q\bigg]\\ &= -\bigg[\sum_{k=0}^{\infty}pq^{k}\log p + \sum_{k=0}^{\infty}pq^{k}\log q\bigg] \end{aligned}

From calculus, we know that n = 0 r n = 1 1 r \sum_{n=0}^{\infty}r^{n} = \frac{1}{1-r} and n = 0 n r n = r ( 1 r ) 2 \sum_{n=0}^{\infty}nr^{n} = \frac{r}{(1-r)^2} for any real number r r with r < 1 |r| < 1 . This gives us

H ( X ) = [ p log p 1 q + p q log q p 2 ] = [ p log p + q log q p ] = [ p log p + ( 1 p ) log ( 1 p ) p ] \begin{aligned} H(X) &= -\bigg[\frac{p\log p}{1 - q} + \frac{pq\log q}{p^2}\bigg] \\ &= -\bigg[\frac{p\log p + q\log q}{p}\bigg]\\ &= -\bigg[\frac{p\log p + (1-p)\log (1-p)}{p}\bigg]\\ \end{aligned}

If we let Y B e r n ( p ) Y \sim Bern(p) , this further simplifies to

H ( X ) = H ( Y ) p \begin{aligned} H(X) &= \frac{H(Y)}{p} \end{aligned}

and we visualize our entropy for for different values of p p below:

Figure 2. Entropy of X G e o m ( p ) X \sim Geom(p) as a function of p p . Note that H ( p ) 0 H(p) \to 0 as p 1 p \to 1 and H ( p ) H(p) \to \infty as p 0 p \to 0 . This should agree with our intuition; with a high chance of success we'd expect our process to end quickly. On the other hand, with a lower chance of success we have a much fatter tail with a higher variability of outcomes.

As we might expect, H ( p ) 0 H(p) \to 0 as p 1 p \to 1 since there’s less variability in the number of trials we need until our first success. On the other hand H ( p ) H(p) \to \infty as p 0 p \to 0 , which reflects the fact that it’s much harder to predict the number of trials required to reach a success as the probability of success is lowered.

One immediate consequence of the definition of entropy is the following:

H ( X ) 0 H(X) \geq 0

Proof:

Going back to our definition of entropy:

H ( X ) = x X p ( x ) log p ( x ) H(X) = -\sum_{x \in \mathcal{X}}p(x)\log p(x)

Since our p ( x ) p(x) terms are probabilities, we must have 0 p ( x ) 1 0 \leq p(x) \leq 1 . Since log x 0 \log x \leq 0 for 0 < x 1 0 < x \leq 1 , so we must have x X p ( x ) log p ( x ) 0 \sum_{x \in \mathcal{X}}p(x)\log p(x) \leq 0 . Multiplying by the negative sign in the definition of H ( X ) H(X) then completes the proof. \square

This fact will prove very useful as we proceed in the course, so it’s good to commit to memory now.

The reader may have noticed that our definition of entropy is nearly identical to that of the expected value of a random variable, differing only in the presence of the log \log function. This begs the question - can we reconcile these two concepts? First, as a reminder we define expected value as

The expected value E [ X ] E[X] of a discrete random variable X p ( x ) X \sim p(x) is E [ X ] = x X p ( x ) x E[X] = \sum_{x \in \mathcal{X}} p(x)\cdot x

Now, suppose we apply some deterministic function g g to our random variable X X - this gives us a new random variable g ( X ) g(X) . Since the randomness in g ( X ) g(X) is entirely due to the randomness in X X , intuitively we should be able to compute E g ( X ) Eg(X) using X X ’s distribution p ( x ) p(x) without needing to compute g ( X ) g(X) ’s distribution. It turns out that our intuition here is correct; this is a well-known result in probability theory that is used enough to have its own name

For a discrete random varible X p ( x ) X \sim p(x) and function g  ⁣ : R R g\colon \mathbb{R} \to \mathbb{R} E [ g ( X ) ] = x X p ( x ) g ( x ) E[g(X)] = \sum_{x \in \mathcal{X}} p(x)\cdot g(x)

Despite the intuitive nature of this theorem, proving it requires some work which we omit here and leave as an exercise for the reader.Setting g ( X ) = log p ( X ) g(X) = -\log p(X) and using LOTUS we then have

H ( X ) = E [ log p ( X ) ] = E [ I ( X ) ] H(X) = E[-\log p(X)] = E[I(X)]

which allows us to formally justify our notion of entropy as the average amount of information contained in the outcome of a random varible.

We’ll go over one final example before concluding these notes. Suppose we want to communicate the outcome of an event represented by a random variable X X . Ideally (to save money, power, etc.) we’d like to do so with as short a message as possible on average. Naturallly, our encoding scheme should use fewer bits to represent high-probability events and more bits to encode low probability ones. It turns out (we’ll prove this later), any coding scheme for this problem requires at least I ( X ) I(X) bits on average to send our message without corrupting its contents. In other words, H ( X ) = E [ I ( X ) ] H(X) = E[I(X)] provides a lower bound on the average number of bits required to guarantee that our message will be transmitted successfully. While we won’t have the tools to prove this until later, a simple example will help illustrate this idea

Let X = { a with probability 1/2 b with probability 1/4 c with probability 1/8 d with probability 1/8 X = \begin{cases} a & \text{with probability 1/2} \\ b & \text{with probability 1/4} \\ c & \text{with probability 1/8} \\ d & \text{with probability 1/8} \end{cases}

How could we devise a code to transmit the outcome of X X in this case? One possible scheme is as follows: If X = a X = a , we send a 1 and end transmission, otherwise we send a 0 and continue. Similarly, if X = b X = b , we then send a 1 and stop, otherwise we send another 0. Finally, if X = c X = c we send a 1 and stop, otherwise we send a 0 and stop. Under this scheme, we send a total of 1 bit half the time, 2 bits 1/4 of the time, and 3 bits 1/4 of the time, giving us a total of 1.75 bits on average.

Figure 3. Visual representation of our probability space. At each step of our encoding algorithm, we eliminate half of the space that still remains.

Each time we ask our “questions” in this scheme, we’re eliminating half of the remaining probability space. This is very similar to the way that binary search eliminates half the possible solution space at each step of the algorithm. Moreover, just as binary search is an optimal algorithm (i.e., takes the fewest number of steps possible to still guarantee a correct answer), we would expect our encoding scheme to be optimal for X X .

We can verify this intuition by plugging into our formula for entropy

H ( X ) = 1 2 log 1 2 1 4 log 1 4 1 8 log 1 8 1 8 log 1 8 = 1.75  bits H(X) = -\frac{1}{2}\log\frac{1}{2}-\frac{1}{4}\log\frac{1}{4}-\frac{1}{8}\log\frac{1}{8}-\frac{1}{8}\log\frac{1}{8} = 1.75 \text{ bits}

Since our average number of bits needed to encode the outcome of X X can’t be less than H ( X ) H(X) , our proposed scheme is indeed optimal.

Footnotes:


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

趣学算法

趣学算法

陈小玉 / 人民邮电出版社 / 2017-7-1 / 89.00元

本书内容按照算法策略分为7章。 第1章从算法之美、简单小问题、趣味故事引入算法概念、时间复杂度、空间复杂度的概念和计算方法,以及算法设计的爆炸性增量问题,使读者体验算法的奥妙。 第2~7章介绍经典算法的设计策略、实战演练、算法分析及优化拓展,分别讲解贪心算法、分治算法、动态规划、回溯法、分支限界法、线性规划和网络流。每一种算法都有4~10个实例,共50个大型实例,包括经典的构造实例和实......一起来看看 《趣学算法》 这本书的介绍吧!

HTML 压缩/解压工具
HTML 压缩/解压工具

在线压缩/解压 HTML 代码

在线进制转换器
在线进制转换器

各进制数互转换器

MD5 加密
MD5 加密

MD5 加密工具