内容简介:In William Gibson’s 2010 novelFor good or for worse, machine learning algorithms can be tricked by slight changes to inputs, intentional or not, into its system. Recently in 2020, the cyber security firm McAfee showed that Mobileye — the car intelligence s
Why malicious inputs work and how to prevent them
In William Gibson’s 2010 novel Zero History , a character preparing to go in a high-stakes raid wears an oddly-patterned t-shirt that renders him invisible on the monitoring CCTVs. It’s an idea many science fiction writers have written about, and it has captivated audiences so much because it challenges the notion that AI is unbeatable and all-knowingly. With a simple trick, someone can trick the algorithm? — it is a fun idea in sci-fi, but it’s can’t happen with real machine learning algorithms. Or so we thought.
For good or for worse, machine learning algorithms can be tricked by slight changes to inputs, intentional or not, into its system. Recently in 2020, the cyber security firm McAfee showed that Mobileye — the car intelligence system used by Tesla and other auto manufacturers — could be fooled into accelerating 50 MPH over the speed limit just by plastering a strip of black tape two inches wide to a speed limit sign.
Researchers from four universities including the University of Washington and UC Berkeley discovered that road sign recognition models were completely fooled when introduced to a bit of spray paint or stickers on stop signs — all completely natural and non-malicious alterations.
Researchers at MIT 3-d printed a toy turtle with a texture especially designed to make Google’s object detection system classify it as a rifle, regardless of the angle at which the turtle was viewed. One can imagine how catastrophic the result would be if these sorts of systems were utilized in public spaces to prevent shooters and a child was holding that textured toy turtle. On the other hand, think of a rifle textured not to look like one.
As machine learning takes an ever more important role in the world, these types of so-called ‘adversarial inputs’ — designed to be malicious or not — are a serious problem in the real-world development of these algorithms. So when state-of-the-art image recognition neural networks misclassify a panda as a gibbon when a seemingly invisible adversarial filter is introduced…
…we can’t help but wonder why neural networks have this vulnerability.
Most adversarial inputs take advantage of something called a weak decision boundary. As the neural network is trained on thousands or even millions of training examples, it continually adjusts certain thresholds and rules it stores internally that dictate how it classifies the example. For example, consider a neural network trained to classify digits from 0 to 9: as it loops through countless training examples, its decision boundary becomes more and more firm in places where there is more ‘activity’. If a certain pixel has a value near 0 for half of the digits and a value near 1 for the other half, the network would store this useful information and utilize it in its predictions.
But for pixels that remain relatively constant throughout all the images, like those along the perimeter of the image, the information that specific pixel adds to the decision-making process is less clear, since almost all of them are the same value regardless of pixel value. Yet occasionally, there may be one or two images, like the 6 below, that collide with the target location. This makes that pixel extremely sensitive to any changes. The decision boundary in that dimension is then considered to be very weak.
Hence, switching that pixel completely white on any image would be taking advantage of this very sensitive part of the input and would drastically increase the chance of a model marking the image as a ‘6’, since it recalls ‘6’ being the only training example that had a value other than 0 for that value. In reality, however, that pixel value has nothing to do with a 6; it might as well have been an ‘8’ or a ‘3’. By random chance, 6 happened to be the only digit that had a unique value for that pixel, but because the pixel was so sensitive, the model took away the wrong conclusion.
Although this example utilizes the freedom to change one pixel, most modernized adversarial inputs adjust all of the pixels a little bit, which allows for more complexity and subtlety, although the reason why it works is the same as why changing one pixel in an image would work.
These sorts of weak decision boundaries will inevitably exist in any neural network because of the nature of datasets, which naturally will have pixels that provide little information. However, studies in these malicious inputs show that one set of sensitive pixels in one neural network architecture do not necessarily show the same level of sensitivity in other architectures and datasets. Hence, malicious inputs are constructed based on pixel sensitivities of ‘standard’ architectures, like VGG16 or ResNet.
Certain changes, however, like the two inches of tape mentioned above, are ‘universal’ in that they attempt to target sensitive locations regardless of model structure or dataset. Adversarial filters, like the one applied to the panda image above, take advantage of several weak boundaries and sensitive combinations of inputs. These types of alterations, intentionally malicious or not, are very dangerous for applications of machine learning in places like self-driving cars.
What’s worse, physical adversarial inputs, or perbutations, aren’t actually ‘hacking’ at all. Placing stickers on stop signs, or even holding signs from different angles and perspectives, can cause sign recognition neural networks to misclassify the signs as yield and speed limit 45 signs; anyone who has been in a city has probably seen stickers on signs and other sources of natural physical perbutations at least dozens of times.
These sorts of attacks don’t just happen in images, though. Simple linear NLP word-vectorized classification models (e.g. logistic regression on bag of words) that perform so well on identifying spam emails are failing more often because spammers overwhelm their email content with so-called ‘good words’ and deliberately misspell ‘bad words’ that trick the model into confidently classifying it as non-spam. Other inputs deliberately take advantage of statistical weaknesses in clustering algorithms and attempt to distort the clustering process.
So how can machine learning engineers secure their models against adversarial inputs that could lead to a disastrous outcome?
The simplest brute-force method is adversarial training, which is when the model is trained on all sorts of possibilities of perbutations and hence becomes robust to them. One method of achieving this is with data augmentation, like the data generator found in Keras — these augmenters can flip an image, distort it around a point, turn it, change the brightness, etc. Other forms of augmentation may be to randomly scatter noise masks or to randomly apply known adversarial input filters over images. This augmentation can strengthen the decision boundaries for sensitive pixels.
Sometimes, adversarial training is enough in that it covers all the possible scenarios a machine learning model would encounter. The issue with this method, however, is that the model is explicitly told to be robust to each randomly generated potential issue, and hence have difficulty generalizing a solution to new issues, like a uniquely designed sticker.
An alternative solution is defensive distillation, a strategy in which the model is trained based on performance predicting probabilities instead of classes. For example, if a neural network were trained to categorize cats vs dogs, the metric would not be accuracy, as in how many times the class was successfully predicted, but some function of how far off the probability was from a ground truth probability (the label). These probabilities may be supplied by human annotators or by another earlier model trained on the same task using class labels.
The result of defensive distillation is a model with much smoother landscapes in directions perbutations attempt to exploit (decision boundaries that are sensitive because it is narrowly torn between two classes). This makes it difficult to discover input alterations that can lead to an incorrect categorization. This method was originally created to train smaller models to imitate larger ones (model compression) for computational savings, although it has shown to work well in preventing adversarial inputs.
Adversarial inputs and weak decision boundaries have also been observed in the brains of animals in the study of zoology and even in human brains with visual tricks. Machine learning is already or will be in responsible of millions of lives in the form of surveillance systems, self-driving cars, automated airplanes, and missile detonation systems. We can’t let it be fooled by simple perbutations, and as AI’s presence increases, handling adversarial inputs needs to be at the forefront of every machine learning engineer’s mind.
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。