内容简介:As the first digital computers were being invented in the 1940s and researchers like Alan Turing were formalizing their capabilities, the scientific and engineering communities needed a way to quantify previous intuitive notions ofWe’ve provided a couple e
As the first digital computers were being invented in the 1940s and researchers like Alan Turing were formalizing their capabilities, the scientific and engineering communities needed a way to quantify previous intuitive notions of information and communication . In 1948 Claude Shannon, during a stint at Bell Labs (also the birthplace of the transistor and Unix operating systems, among other major achievements in computer science) provided the first comprehensive framework to do so in his landmark paper A Mathematical Theory of Communication . With the tools provided by this theory (now known as Information Theory ), we can answer important questions about the limits of modern computers like: How can we store information (say, a file on our hard drive) in the most efficient way? Or, how do we successfully communicate information (say, a text message) over a channel (e.g., our cell network) that may not be 100% reliable?
We’ve provided a couple examples of “information” in the abstract sense, but how can we define information in a formal, precise way? We’ll start with an example. Let’s say the city of Seattle has decided to introduce a new service, whereby every morning each resident gets a text message with the day’s weather forecast. If it’s wintertime, where the weather is overcast ~100% of the time, sending the text is redundant; a resident will always be correct if they have basic knowledge of the city’s weather patterns and assume that every day will be overcast. In other words, the forecast won’t provide any new information for city residents , since there’s no uncertainty in the outcome.
Now, suppose instead we’re in a different season with multiple possibilities for the day’s weather. In this case our service will actually be useful - residents no longer can just assume that the weather will be overcast every day and be correct 100% of the time. Just how useful is a given forecast? Suppose we’re in the summer, and the weather (optimistically) is sunny 75% of the time and overcast 25% of the time. In this case, a resident would be correct on most days if they naively assumed the day’s weather would be sunny, and a forecast of a sunny day thus doesn’t provide much new information as compared to the naive baseline. Correspondingly, due to their relative rarity, a forecast indicating a cloudy day would be much more informative, as the naive prediction would be incorrect in this case.
From the previous example we see that the information transmitted when communicating the outcome of an event is intimately related to frequency of that outcome. With this idea in mind, we introduce our first concept in information theory, self-information .
Let’s say we have some discrete random variable taking on a set of potential values with a corresponding probability mass function . We then define
The self-information (sometimes known as "surprisal" ) of an outcome is defined as
We usually take the in the above definition to be base 2, so self-information is measured in bits . At a first glance our definition of may seem arbitrary, but we can see that it captures the ideas we had in our thought experiment earlier, along with some other nice properties. As , we have that . This aligns with our intuition that information about an event is only transferred from a source to a recipient if the recipient was not already aware of that event’s outcome. Similarly, as , we have that , reflecting that the communication of a rare outcome is more informative than a common one.
Finally, suppose we wish to communicate the outcomes of two independent events . Since the outcomes of and are independent of each other, we’d expect that a compound message containing both outcomes would convey an amount of information equal to the sum of sending two messages containing each outcome individually. The self-information satisfies this property too, since for some joint outcome we have
which matches from what we’d expect of a measure of information.
Now, suppose out outcome isn’t known ahead of time, but we still want a sense of how much information we’ll be transmitting. One simple way to do this is to weight the self-information of each outcome by the probability of that outcome . This gives us our second major concept in information theory, entropy
The entropy of a random variable is defined as
By convention we assign a value of to when , even though technically is undefined. We can justify this by continuity, as it can be shown as .This too aligns with our intuition - an outcome with no chance of occuring doesn’t have any effect on the variability of our event .
Entropy is quite possibly the “fundamental unit” of information theory, and it’ll continue coming up in all kinds of interesting ways. We can think of as quantifying how much we’ll be “surprised” by the outcome of on average. An example will help illustrate this idea
Let . In other words, we have
Plugging in to our formula for entropy, we have
To build up our intuition we can visualize our entropy for different values of in the following graph (where we let denote as a function of )
Figure 1. Entropy of as a function of (mouse over to see precise values). Note that is maximized when (maximum variability in the outcome) and minimized when or (no variability in the outcome).
As we would expect, is 0 when is deterministic ( or ) as there’s no potential for surprise in such cases. On the other hand, it’s maximized when when the variability in outcomes is greatest. We’ll work through another more complicated example before moving on.
Let . In other words, the probability that we need Bernoulli trials before reaching our first 1 is (letting )
Plugging into our formula for entropy we have
From calculus, we know that and for any real number with . This gives us
If we let , this further simplifies to
and we visualize our entropy for for different values of below:
Figure 2. Entropy of as a function of . Note that as and as . This should agree with our intuition; with a high chance of success we'd expect our process to end quickly. On the other hand, with a lower chance of success we have a much fatter tail with a higher variability of outcomes.
As we might expect, as since there’s less variability in the number of trials we need until our first success. On the other hand as , which reflects the fact that it’s much harder to predict the number of trials required to reach a success as the probability of success is lowered.
One immediate consequence of the definition of entropy is the following:
Proof:
Going back to our definition of entropy:
Since our terms are probabilities, we must have . Since for , so we must have . Multiplying by the negative sign in the definition of then completes the proof.
This fact will prove very useful as we proceed in the course, so it’s good to commit to memory now.
The reader may have noticed that our definition of entropy is nearly identical to that of the expected value of a random variable, differing only in the presence of the function. This begs the question - can we reconcile these two concepts? First, as a reminder we define expected value as
The expected value of a discrete random variable is
Now, suppose we apply some deterministic function to our random variable - this gives us a new random variable . Since the randomness in is entirely due to the randomness in , intuitively we should be able to compute using ’s distribution without needing to compute ’s distribution. It turns out that our intuition here is correct; this is a well-known result in probability theory that is used enough to have its own name
For a discrete random varible and function
Despite the intuitive nature of this theorem, proving it requires some work which we omit here and leave as an exercise for the reader.Setting and using LOTUS we then have
which allows us to formally justify our notion of entropy as the average amount of information contained in the outcome of a random varible.
We’ll go over one final example before concluding these notes. Suppose we want to communicate the outcome of an event represented by a random variable . Ideally (to save money, power, etc.) we’d like to do so with as short a message as possible on average. Naturallly, our encoding scheme should use fewer bits to represent high-probability events and more bits to encode low probability ones. It turns out (we’ll prove this later), any coding scheme for this problem requires at least bits on average to send our message without corrupting its contents. In other words, provides a lower bound on the average number of bits required to guarantee that our message will be transmitted successfully. While we won’t have the tools to prove this until later, a simple example will help illustrate this idea
Let
How could we devise a code to transmit the outcome of in this case? One possible scheme is as follows: If , we send a 1 and end transmission, otherwise we send a 0 and continue. Similarly, if , we then send a 1 and stop, otherwise we send another 0. Finally, if we send a 1 and stop, otherwise we send a 0 and stop. Under this scheme, we send a total of 1 bit half the time, 2 bits 1/4 of the time, and 3 bits 1/4 of the time, giving us a total of 1.75 bits on average.
Figure 3. Visual representation of our probability space. At each step of our encoding algorithm, we eliminate half of the space that still remains.
Each time we ask our “questions” in this scheme, we’re eliminating half of the remaining probability space. This is very similar to the way that binary search eliminates half the possible solution space at each step of the algorithm. Moreover, just as binary search is an optimal algorithm (i.e., takes the fewest number of steps possible to still guarantee a correct answer), we would expect our encoding scheme to be optimal for .
We can verify this intuition by plugging into our formula for entropy
Since our average number of bits needed to encode the outcome of can’t be less than , our proposed scheme is indeed optimal.
Footnotes:
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。