Knowing known unknowns with deep neural networks

栏目: IT技术 · 发布时间: 4年前

内容简介：As a simple example of this failure mode in DNNs, consider training a DNN for a binary classification task. You might reasonably presume that the softmax (or sigmoid) output of a DNN could be used to measure how certain or uncertain the DNN is in its predi

Defining and quantifying aleatory and epistemic uncertainty in deep neural networks

Jacob Reinhold

Jun 21 ·14min read

Deep neural networks (DNNs) are easy-to-implement, versatile machine learning models that can achieve state-of-the-art performance in many domains (for example, computer vision , natural language processing , speech recognition , recommendation systems ). DNNs, however, are not perfect. You can read any number of articles , blog posts , and books discussing the various problems with supervised deep learning. In this article we’ll focus on a (relatively) narrow but major issue: the inability for a standard DNN to reliably show when it is uncertain about a prediction. For a Rumsfeldian take on it: The inability of DNNs to know “known unknowns.”

As a simple example of this failure mode in DNNs, consider training a DNN for a binary classification task. You might reasonably presume that the softmax (or sigmoid) output of a DNN could be used to measure how certain or uncertain the DNN is in its prediction; you would expect that seeing a softmax output close to 0 or 1 would indicate certainty, and an output close to 0.5 would indicate uncertainty. In reality, the softmax outputs are rarely close to 0.5 and are, more frequently than not, close to 0 or 1 regardless of whether the DNN is making a correct prediction. Unfortunately, this fact makes naive uncertainty estimates unreliable (for instance, entropy over the softmax outputs).

To be fair, uncertainty estimates are not needed for every application of a DNN. If a social media company uses a DNN to detect faces in images so that its users can more easily tag their friends, and the DNN fails, then the failure of the method is nearly inconsequential. A user might be slightly inconvenienced, but in low-stakes environments like social media or advertising, uncertainty estimates aren’t vital to creating value from a DNN.

In high-stakes environments, however, like self-driving cars, health care, or military applications, a measure of how uncertain the DNN is in its prediction could be vital. Uncertainty measurements can reduce the risk of deploying a model because they can alert a user to the fact that a scenario is either inherently difficult to do prediction in, or the scenario has not been seen by the model before.

In a self-driving car, it seems plausible that a DNN should be more uncertain about predictions at night (at least in the measurements coming from optical cameras) because of the lower signal-to-noise ratio. In health care, a DNN that diagnoses skin cancer should be more uncertain if it were shown a particularly blurry image, especially if the model had not seen such blurry examples in the training set. In a model to segment satellite imagery, a DNN should be more uncertain if an adversary changed how they disguise certain military installations. If the uncertainty inherent in these situations were relayed to the user, the information could be used to change the behavior of the system in a safer way.

In this article, we explore how to estimate two types of statistical uncertainty alongside a prediction in a DNN. We first discuss the definition of both types of uncertainty, and then we highlight one popular and easy-to-implement technique to estimate these types of uncertainty. Finally, we show and implement some examples for both classification and regression that makes use of these uncertainty estimates.

For those who are most interested in looking at code examples, here are two Jupyter Notebooks one with a toy regression example and the other with a toy classification example . There are also PyTorch-based code snippets in the “Examples and Applications” section below.

What do we mean by ‘uncertainty’?

Uncertainty is defined by the Cambridge Dictionary as: “a situation in which something is not known.” There are several reasons why something may not be known, and — taking a statistical perspective — we will discuss two types of uncertainty called aleatory (sometimes referred to as aleatoric ) and epistemic uncertainty.

Aleatory uncertainty relates to an objective or physical concept of uncertainty — it is a type of uncertainty that is intrinsic to the data-generating process. Since aleatory uncertainty has to do with an intrinsic quality of the data, we presume it cannot be decreased by collecting more data; that is, it is irreducible .

Aleatory uncertainty can be explained best with a simple example: Suppose we have a coin which has some positive probability of being heads or tails. Then, even if the coin is biased, we cannot predict — with certainty — what the next toss will be, regardless of how many observations we make. (For instance, if the coin is biased such that heads turn up with probability 0.9, we might reasonably guess that heads will show up in the next toss, but we cannot be certain that it will happen.)

Epistemic uncertainty relates to a subjective or personal concept of uncertainty — it is a type of uncertainty due to knowledge or ignorance of the true data-generating process. Since this type of uncertainty has to do with knowledge, we presume that it can be decreased ( for example, when more data has been collected and used for training); that is, it is reducible.

Epistemic uncertainty can be explained with a regression example. Suppose we are fitting a linear regression model and we have independent variables x between -1 and 1, and corresponding dependent variables y for all x . Suppose we chose a linear model because we believe that when x is between -1 and 1, the model is linear. We don’t, however, know what happens when a test sample x * is far outside this range; say at x * = 100. So, in this scenario, there is uncertainty about the model specification (for example, the true function may be quadratic) and there is uncertainty because the model hasn’t seen data in the range of the test sample. These uncertainties can be bundled into uncertainty regarding the knowledge of true data-generating distribution, which is epistemic uncertainty.

The terms aleatory and epistemic , with regards to probability and uncertainty, seem to have been brought into the modern lexicon by Ian Hacking in his book “ The Emergence of Probability ,” which discusses the history of probability from 1600–1750. The terms are not clear for the uninitiated reader, but their definitions are related to the deepest question in the foundations of probability and statistics: What does probability mean ? If you are familiar with terms frequentist and Bayesian , then you will see the respective relationship between aleatory (objective) and epistemic (subjective) uncertainty. I’m not about to solve this philosophical issue in this blog post, but know that the definitions of aleatory and epistemic uncertainty are nuanced, and what falls into which category is debatable. For a more comprehensive (but still applied) review of these terms, take a look at the article: “ Aleatory or Epistemic? Does it matter? ”

Why is it important to distinguish between aleatory and epistemic uncertainty? Suppose we are developing a self-driving car, and we take a prototype that was trained on normal roads and have it drive through the Monza racing track , which has extremely banked turns.

Knowing known unknowns with deep neural networks — **Fig. 1** : Banked turns on the Monza Racing Track

Since the car hasn’t seen the situation before, we would expect the image segmentation DNN in the self-driving car, for example, to be uncertain because it has never seen the sky nearly to the left of ground. In this case, the uncertainty would be classified as epistemic because the DNN doesn’t have knowledge of roads like this.

Suppose instead that we take the same self-driving car and take it for a drive on a rainy day; assume that the DNN has been trained on lots of rainy-day conditions. In this situation, there is more uncertainty about objects on the road simply due to lower visibility. In this case, the uncertainty would be classified as aleatory because there is inherently more randomness in the data.

These two situations should be dealt with differently. In the race track, the uncertainty could tell the developers that they need to gather a particular type of training data to make the model more robust, or the uncertainty could tell the car could try to safely maneuver to a location where it can hand-off control to the driver. In the rainy-day situation, the uncertainty could alert the system to simply slow down or enable certain safety features.

Estimating uncertainty in DNNs

There has been a cornucopia of proposed methods to estimate uncertainty in DNNS in recent years. Generally, uncertainty estimation is formulated in the context of Bayesian statistics. In a standard DNN for classification, we are implicitly training a discriminative model where we obtain maximum-likelihood estimates of the neural network weights (depending on the loss function chosen to train the network). This point-estimate of the network weights is not amenable to understanding what the model knows and does not know. If we instead find a distribution over the weights, as opposed to the point-estimate, we can sample network weights with which we can compute corresponding outputs.

Intuitively, this sampling of network weights is like creating an ensemble of networks to do the task: We sample a set of “experts” to make a prediction. If the experts are inconsistent, there is high epistemic uncertainty. If the experts think it is too difficult to make an accurate prediction, there is high aleatory uncertainty.

In this article, we’ll take a look at a popular and easy-to-implement method to estimate uncertainty in DNNs by Yarin Gal and Zoubin Ghahramani . They showed that dropout can be used to learn an approximate distribution over the weights of a DNN (as previously discussed). Then, during prediction, dropout is used to sample weights from this fitted approximate distribution — akin to creating the ensemble of experts.

Epistemic uncertainty is estimated by taking the sample variance of the predictions from the sampled weights. The intuition behind relating sample variance to epistemic uncertainty is that the sample variance will be low when the model predicts nearly identical outputs, and it will be high when the model makes inconsistent predictions; this is akin to when the set of experts consistently makes a prediction and when they do not, respectively.

Simultaneously, aleatory uncertainty is estimated by modifying a DNN to have a second output, as well as using a modified loss function. Aleatory uncertainty will correspond to the estimated variance of the output . This predicted variance has to do with an intrinsic quantity of the data, which is why it is related to aleatory uncertainty; this is akin to when the set of experts judges the situation too difficult to make a prediction.

Altogether the final network structure is something like what is shown in Fig. 2. There is an input x which is fed to a DNN with dropout after every layer (dropout after every layer is what is originally specified, but — in practice — dropout after every layer often makes training too difficult). The output of this DNN is an estimated target y- hat and an estimated variance or scale parameter σ- hat. (Medium is terrible for rendering math. Sorry.)

This DNN is trained with a loss function like:

If the network is being trained for a regression task. The first loss function shown above is an MSE variant with uncertainty, whereas the second is an L1 variant. These are derived from assuming a Gaussian and Laplace distribution for the likelihood , respectively, where each component is independent and the variance (or scale parameter) is estimated and fitted by the network.

As mentioned above, these loss functions have mathematical derivations, but we can intuit why this variance parameter captures a type of uncertainty: The variance parameter provides a trade-off between the variance and the MSE or L1 loss term. If the DNN can easily estimate the true value of the target (that is, get y- hat close to the true y ), then the DNN should estimate a low variance term on that so as to minimize the loss. If, however, the DNN cannot estimate the true value of the target (for example, there is low signal-to-noise ratio), then the network can minimize the loss by estimating a high variance. This will reduce the MSE or L1 loss term because that term will be divided by the variance; however, the network should not always do this because of the log variance term which penalizes high variance estimates.

If the network is being trained for a classification (or segmentation) task, the loss would look something like this two-part loss function:

The intuition here with this loss function is: When the DNN can easily estimate the right class of a component, the value y- hat will be high for that class and the DNN should estimate a low variance so as to minimize the added noise (so that all samples will be concentrated around the correct class). If, however, the DNN cannot easily estimate the class of the component, the y- hat value should be low and adding noise can increase the guess, by chance, for the correct class which can overall minimize the loss function. (See Pg. 41 of Alex Kendall’s thesis for more discussion on this loss function.)

Finally, in testing, the network is sampled T times to create T estimated targets and T estimated variance outputs. These T outputs are then combined in various ways to make the final estimated target and uncertainty estimates as shown in Fig. 3.

Mathematically, the epistemic and aleatory uncertainty are (for the MSE regression variant):

以上所述就是小编给大家介绍的《Knowing known unknowns with deep neural networks》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

猜你喜欢:

Knowing known unknowns with deep neural networks

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

ASP.NET 2.0技术内幕

埃斯帕斯托 / 施平安 / 清华大学出版社 / 2006-8 / 68.00元

《ASP.NET2.0技术内幕》围绕着ASP.NET 2.0是Web开发的重要分水岭这一主题，采用自顶向下的方式介绍ASP.NET 2.0的最新编程实践，从更广泛的特征到具体的实现和编程细节，充分展示了ASP.NET的最新编程实践。全书共15章，主题涉及HTTP运行库、安全性、缓存、状态管理、控件、数据绑定和数据访问。　　《ASP.NET2.0技术内幕》主题丰富，讲解透彻，包含大量实例，是......一起来看看《ASP.NET 2.0技术内幕》这本书的介绍吧!

码农工具

Knowing known unknowns with deep neural networks

Defining and quantifying aleatory and epistemic uncertainty in deep neural networks

What do we mean by ‘uncertainty’?

Estimating uncertainty in DNNs

ASP.NET 2.0技术内幕

HTML 压缩/解压工具

CSS 压缩/解压工具

HSV CMYK 转换工具